Error in running Random Forest using PatientLevelPrediction

Alonso · August 23, 2023, 4:18pm

Hello OHDSI Community,

I am currently facing an issue while attempting to run a Random Forest using the PatientLevelPrediction package. I have provided below the essential parts of my code for context:

library(FeatureExtraction)
library(PatientLevelPrediction)
library(DatabaseConnector)
library(ggplot2)
library(reticulate)
#py_install("scipy")
#py_install("joblib")
#py_install("scikit-learn")


# Read the .Renviron file
readRenviron("../../config/.Renviron")

# Retrieve environment variables
db_server = Sys.getenv("SERVER")
db_user = Sys.getenv("USER")
db_password = Sys.getenv("PASSWORD")
db_path_to_driver = Sys.getenv("PATH_TO_DRIVER")

# Create connection details
connectionDetails = createConnectionDetails(
  dbms = "postgresql",
  server = db_server,
  user = db_user,
  password = db_password,
  pathToDriver = db_path_to_driver)

# Set database schemas and version
cdmDatabaseSchema <- "cdm"
cohortsDatabaseSchema <- "cohorts"
cdmVersion <- "5"
outcomeDatabaseSchema = "cohorts"

# Create database details
databaseDetails <- createDatabaseDetails(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cdmDatabaseName = 'OMOP CDM',
  cdmDatabaseId = "1",
  cohortDatabaseSchema = cohortsDatabaseSchema, 
  cohortTable = 'mycohort',
  targetId = 7,
  outcomeDatabaseSchema = outcomeDatabaseSchema,
  outcomeTable = 'mycohort', 
  outcomeIds = c(6),
  cdmVersion = 5
)

# Create restrictPlpDataSettings
restrictPlpDataSettings <- createRestrictPlpDataSettings()

# Create study population settings
# predict outcome within 0 to 365 days days after index
populationSettings = createStudyPopulationSettings(
  binary = TRUE,
  washoutPeriod = 0,
  firstExposureOnly = FALSE,
  removeSubjectsWithPriorOutcome = FALSE,
  priorOutcomeLookback = 99999,
  riskWindowStart = 0,
  riskWindowEnd = 365,
  startAnchor = 'cohort start',
  endAnchor = 'cohort start',
  minTimeAtRisk = 364,
  requireTimeAtRisk = FALSE,
  includeAllOutcomes = TRUE )

# Create covariate settings
# use age/gender in groups and measurements as features
covariateSettings = createCovariateSettings(
  useDemographicsGender = TRUE, 
  useDemographicsAge = TRUE,
  useDemographicsAgeGroup = TRUE,
  useMeasurementValueShortTerm = TRUE,
  useMeasurementValueMediumTerm = TRUE,
  shortTermStartDays = -5,
  mediumTermStartDays = -10,
  endDays = 0)

# Define the settings for data splitting
splitSettings = createDefaultSplitSetting(
  trainFraction = 0.75,
  testFraction = 0.25,
  type = 'stratified',
  nfold = 3,
  splitSeed = 1234
)

#Define the preprocess settings
preprocessSettings <- createPreprocessSettings(
  minFraction = 0.01,
  normalize = TRUE,
  removeRedundancy = TRUE
)

# Get PLP data
plpData = getPlpData(
  databaseDetails = databaseDetails,
  covariateSettings = covariateSettings,
  restrictPlpDataSettings = restrictPlpDataSettings )

results = runPlp(
  plpData = plpData,
  outcomeId = 6, 
  analysisId = 2,
  analysisName = 'Random Forest Model for Mortality Prediction',
  logSettings = createLogSettings(),
  populationSettings = populationSettings, 
  featureEngineeringSettings = createRandomForestFeatureSelection(),
  sampleSettings = createSampleSettings(), 
  splitSettings = splitSettings, 
  preprocessSettings = preprocessSettings, 
  modelSettings = setRandomForest(),
  executeSettings = createExecuteSettings(
    runSplitData = TRUE,
    runSampleData = TRUE,
    runfeatureEngineering = TRUE,
    runPreprocessData = TRUE,
    runModelDevelopment = TRUE,
    runCovariateSummary = TRUE),
  saveDirectory = file.path(getwd(), 'model')
)

Upon executing the runPlp function, I encountered the following error:

Error: ValueError: could not assign tuple of length 8 to structure with 7 fields.


Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

The Random Forest model is mostly using default parameters. Any guidance or insights into resolving this issue would be highly appreciated. Thank you for your assistance!

Best regards,
Alonso

plpLog.txt (269.0 KB)

Alonso · August 24, 2023, 8:36pm

Quick update on my previous post: Using setLassoLogisticRegression() as model has been successful. However, I am still curious to understand why I faced the aforementioned issue when attempting to utilize setRandomForest() for the analysis.

egillax · August 28, 2023, 11:57am

Hi Alonso,

This is an issue when using latest scikit-learn (1.3.0). This will be patched in https://github.com/OHDSI/PatientLevelPrediction/pull/410 - which should be released this week. Just waiting on @jreps to review my patch.

In the meantime a solution could be to downgrade scikit-learn or install PLP from the patch branch with

remotes::install_github('ohdsi/406-hades_weekly_fail')

Regards,
Egill

egillax · August 28, 2023, 12:49pm

Hi again @Alonso ,

The patch has just been merged. So should work now.

Egill

Alonso · August 29, 2023, 6:15pm

Thanks @egillax, for your prompt response and helpful guidance. My issue has been successfully resolved.