JVM settings to optimize DQD in RStudio

I am using DQD on an OMOP payload of about 500GB. I am setting -XmX and -XmS to 64GB (I have 128GB RAM and a 64-core AMD processor on Azure). R Studio shows both JVM parameters set to those values. That said, task manager shows the R Studio session when running DQD to be consuming only 4GB–which is the default JVM setting. Has anyone crafted a cookbook for performance tuning DQD for large payloads?

I also set numThreads to 2, 4, 8, 12 and 16 (restarting RStudio each time). When setting numThreads to anything other than 1 the session appears to freeze with a percent indicator in the lower right-hand corner of the log window that does get any higher than 27%. Not even sure what that indicator is telling me, frankly. Below is code for numThreads=1—which does progress but takes a very long time to complete.

DataQualityDashboard::executeDqChecks(
connectionDetails = connectionDetails, # Your connection details
cdmDatabaseSchema = “OMOP_PROD.dbo”, # Schema where OMOP data is stored
vocabDatabaseSchema=“OMOP_DEV.dbo”,
verboseMode= TRUE,
resultsDatabaseSchema = “OMOP_PROD.dbo”, # Schema for storing results
cdmSourceName = “OMOP_PROD”, # Friendly name for the CDM database
numThreads = 1, # Number of threads to use
outputFolder = “c:/DQDOutput/OMOP_PROD”, # Folder to save output files
outputFile=“output.json”,
cdmVersion = “5.4”, # OMOP CDM version

tablesToExclude = c(“CONCEPT”, “VOCABULARY”, “CONCEPT_ANCESTOR”,
“CONCEPT_RELATIONSHIP”, “CONCEPT_CLASS”, “CONCEPT_SYNONYM”,“DRUG_STRENGTH”,
“RELATIONSHIP”, “DOMAIN”,“COST”,“FACT_RELATIONSHIP”,
“DOSE_ERA”,“DRUG_ERA”,“FACT_RELATIONSHIP”,“NOTE_NLP”,“EPISODE”,
“COST”,“OBSERVATION_PERIOD”,“EPISODE_EVENT”,“CONDITION_ERA”,
“COHORT”,“COHORT_DEFINITION”,“METADATA”,“CDM_SOURCE”,
“NOTE”,“SPECIMEN”

                  #"CARE_SITE",
                  #"LOCATION",
                  #"PERSON",
                  #"CONDITION_OCCURRENCE",
                  #"DEATH",
                  #"DEVICE_EXPOSURE",
                  #"DRUG_EXPOSURE",
                  #"OBSERVATION",
                  #"PAYER_PLAN_PERIOD",
                  #"PROCEDURE_OCCURRENCE",
                  #"PROVIDER",
                  #"VISIT_DETAIL",
                  #"VISIT_OCCURRENCE"
                  #"MEASUREMENT"
                  
                  )

)

Hi @wjmetcalfiii thanks for your message. A couple pieces of info that could be helpful:

  • All of the heavy computation in DQD is occurring via SQL in your database platform. So if you’re having performance issues the first place I’d check would be your warehouse. You can inspect long-running queries in the query logs, and if you’re getting stuck specifically in multithreaded mode I would look into whether your database is having an issue with the parallel execution.
  • In case it is some issue in R, this is the package that’s used to kick off the parallel threads: GitHub - OHDSI/ParallelLogger: An R package for easy parallel computing, logging, and function call automation.

Hope that helps as a starting point; please let me know your findings and any further questions :smiley: