Hello OHDSI Community,
I am currently facing an issue while attempting to run a Random Forest using the PatientLevelPrediction package. I have provided below the essential parts of my code for context:
library(FeatureExtraction)
library(PatientLevelPrediction)
library(DatabaseConnector)
library(ggplot2)
library(reticulate)
#py_install("scipy")
#py_install("joblib")
#py_install("scikit-learn")
# Read the .Renviron file
readRenviron("../../config/.Renviron")
# Retrieve environment variables
db_server = Sys.getenv("SERVER")
db_user = Sys.getenv("USER")
db_password = Sys.getenv("PASSWORD")
db_path_to_driver = Sys.getenv("PATH_TO_DRIVER")
# Create connection details
connectionDetails = createConnectionDetails(
dbms = "postgresql",
server = db_server,
user = db_user,
password = db_password,
pathToDriver = db_path_to_driver)
# Set database schemas and version
cdmDatabaseSchema <- "cdm"
cohortsDatabaseSchema <- "cohorts"
cdmVersion <- "5"
outcomeDatabaseSchema = "cohorts"
# Create database details
databaseDetails <- createDatabaseDetails(
connectionDetails = connectionDetails,
cdmDatabaseSchema = cdmDatabaseSchema,
cdmDatabaseName = 'OMOP CDM',
cdmDatabaseId = "1",
cohortDatabaseSchema = cohortsDatabaseSchema,
cohortTable = 'mycohort',
targetId = 7,
outcomeDatabaseSchema = outcomeDatabaseSchema,
outcomeTable = 'mycohort',
outcomeIds = c(6),
cdmVersion = 5
)
# Create restrictPlpDataSettings
restrictPlpDataSettings <- createRestrictPlpDataSettings()
# Create study population settings
# predict outcome within 0 to 365 days days after index
populationSettings = createStudyPopulationSettings(
binary = TRUE,
washoutPeriod = 0,
firstExposureOnly = FALSE,
removeSubjectsWithPriorOutcome = FALSE,
priorOutcomeLookback = 99999,
riskWindowStart = 0,
riskWindowEnd = 365,
startAnchor = 'cohort start',
endAnchor = 'cohort start',
minTimeAtRisk = 364,
requireTimeAtRisk = FALSE,
includeAllOutcomes = TRUE )
# Create covariate settings
# use age/gender in groups and measurements as features
covariateSettings = createCovariateSettings(
useDemographicsGender = TRUE,
useDemographicsAge = TRUE,
useDemographicsAgeGroup = TRUE,
useMeasurementValueShortTerm = TRUE,
useMeasurementValueMediumTerm = TRUE,
shortTermStartDays = -5,
mediumTermStartDays = -10,
endDays = 0)
# Define the settings for data splitting
splitSettings = createDefaultSplitSetting(
trainFraction = 0.75,
testFraction = 0.25,
type = 'stratified',
nfold = 3,
splitSeed = 1234
)
#Define the preprocess settings
preprocessSettings <- createPreprocessSettings(
minFraction = 0.01,
normalize = TRUE,
removeRedundancy = TRUE
)
# Get PLP data
plpData = getPlpData(
databaseDetails = databaseDetails,
covariateSettings = covariateSettings,
restrictPlpDataSettings = restrictPlpDataSettings )
results = runPlp(
plpData = plpData,
outcomeId = 6,
analysisId = 2,
analysisName = 'Random Forest Model for Mortality Prediction',
logSettings = createLogSettings(),
populationSettings = populationSettings,
featureEngineeringSettings = createRandomForestFeatureSelection(),
sampleSettings = createSampleSettings(),
splitSettings = splitSettings,
preprocessSettings = preprocessSettings,
modelSettings = setRandomForest(),
executeSettings = createExecuteSettings(
runSplitData = TRUE,
runSampleData = TRUE,
runfeatureEngineering = TRUE,
runPreprocessData = TRUE,
runModelDevelopment = TRUE,
runCovariateSummary = TRUE),
saveDirectory = file.path(getwd(), 'model')
)
Upon executing the runPlp function, I encountered the following error:
Error: ValueError: could not assign tuple of length 8 to structure with 7 fields.
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
The Random Forest model is mostly using default parameters. Any guidance or insights into resolving this issue would be highly appreciated. Thank you for your assistance!
Best regards,
Alonso
plpLog.txt (269.0 KB)