CohortDiagnostics - parallelizing

Ajit_Londhe · October 24, 2023, 1:08pm

@jpegilbert and CohortDiagnostics team: do you feel the package has any parallelization (using ParallelLogger::clusterApply) potential?

I ask because we’ve seen some slow performance on our end in Redshift, and from what I can tell, the various jobs run in serial.

jpegilbert · October 26, 2023, 7:29pm

Hi @Ajit_Londhe - we have tried some of this in the past but a lot of the performance of sending multiple simultaneous queries to Redshift as diminishing returns because you’re ultimately offloading this to the database engine which quickly becomes the bottleneck (e.g. executing 20 cohorts simultaneously is unlikely to give a big performance gain).

I do have plans to optimize somethings in a CohortDiagnostics 4.0.0 release, perhaps we can set up some time in the next few weeks to discuss priorities on this. I would like to split the execution of different tasks out to make it easier to just execute certain parts of the package but the testing is currently in a state which makes changes like this challenging.