I am using DQD on an OMOP payload of about 500GB. I am setting -XmX and -XmS to 64GB (I have 128GB RAM and a 64-core AMD processor on Azure). R Studio shows both JVM parameters set to those values. That said, task manager shows the R Studio session when running DQD to be consuming only 4GB–which is the default JVM setting. Has anyone crafted a cookbook for performance tuning DQD for large payloads?
I also set numThreads to 2, 4, 8, 12 and 16 (restarting RStudio each time). When setting numThreads to anything other than 1 the session appears to freeze with a percent indicator in the lower right-hand corner of the log window that does get any higher than 27%. Not even sure what that indicator is telling me, frankly. Below is code for numThreads=1—which does progress but takes a very long time to complete.
DataQualityDashboard::executeDqChecks(
connectionDetails = connectionDetails, # Your connection details
cdmDatabaseSchema = “OMOP_PROD.dbo”, # Schema where OMOP data is stored
vocabDatabaseSchema=“OMOP_DEV.dbo”,
verboseMode= TRUE,
resultsDatabaseSchema = “OMOP_PROD.dbo”, # Schema for storing results
cdmSourceName = “OMOP_PROD”, # Friendly name for the CDM database
numThreads = 1, # Number of threads to use
outputFolder = “c:/DQDOutput/OMOP_PROD”, # Folder to save output files
outputFile=“output.json”,
cdmVersion = “5.4”, # OMOP CDM version
tablesToExclude = c(“CONCEPT”, “VOCABULARY”, “CONCEPT_ANCESTOR”,
“CONCEPT_RELATIONSHIP”, “CONCEPT_CLASS”, “CONCEPT_SYNONYM”,“DRUG_STRENGTH”,
“RELATIONSHIP”, “DOMAIN”,“COST”,“FACT_RELATIONSHIP”,
“DOSE_ERA”,“DRUG_ERA”,“FACT_RELATIONSHIP”,“NOTE_NLP”,“EPISODE”,
“COST”,“OBSERVATION_PERIOD”,“EPISODE_EVENT”,“CONDITION_ERA”,
“COHORT”,“COHORT_DEFINITION”,“METADATA”,“CDM_SOURCE”,
“NOTE”,“SPECIMEN”
#"CARE_SITE",
#"LOCATION",
#"PERSON",
#"CONDITION_OCCURRENCE",
#"DEATH",
#"DEVICE_EXPOSURE",
#"DRUG_EXPOSURE",
#"OBSERVATION",
#"PAYER_PLAN_PERIOD",
#"PROCEDURE_OCCURRENCE",
#"PROVIDER",
#"VISIT_DETAIL",
#"VISIT_OCCURRENCE"
#"MEASUREMENT"
)
)