Debugging speed issue with getCohortMethodData

jenwilson521 · May 8, 2020, 12:43am

Hi OHDSI community,

I’m a relatively new user and I have been using CohortMethod to implement a target-comparator analysis on a postgresql database. I have been encountering extremely slow speeds running the getCohortMethodData function - the last run took 21 days to complete! I was wondering what could I debug to fix this or is this expected when there are a significant number of patients?

Thank you!

gregk · May 8, 2020, 1:05am

hi Jennifer - how many patients are in your database? And what kind of data is it?

gregk · May 8, 2020, 1:34am

just to clarify - the reason for my question is that very frequently the reason for slow speed and poor performance is infrastructure itself, that is not properly sized:

ensure PostgreS takes advantage of fine tuning via cache, concurrency and other parameters
more RAM and enough disk
use SSD, especially if it is swapping on disk
create indexes and compute stats

Most importantly, set the right expectations that PostgreS is not build for very large data sets. It should work well - assuming correct infrastructure sizing and optimization as per above - for a decent size datasets. But will not do too well with extremely large data sets. It might change in the future as it is definitely being enhanced with additional scalability features

jenwilson521 · May 9, 2020, 10:19pm

Hi Greg,

Thanks for getting back to me. I’m using a Postgresql database in the OMOP v5 format. I have access to both a smaller and a larger dataset. In the smaller database, the cohort sizes are ~1K and ~1.2K for the target and comparator groups. I was successfully able to run getCohortMethodData on this database, but the full execution ran for 21 days. I also have access to a larger database where my cohorts are ~26K and ~33K patients respectively. So quite big! And I’d like to avoid that analysis taking 3 weeks if possible. My original hypothesis that it might be something to do with the feature extraction - perhaps there’s a dependency that I need to update?

I’m also accessing the database on a share computing cluster, so I don’t think I can increase the RAM, but it should have enough disk to complete the queries.

Thanks again for your thoughts, and if others have ideas, I’d very much appreciate it.

CRoeder · May 11, 2020, 1:02pm

As @gregk said, it could be many issues, more when you factor in a shared resource. I would start by verifying the indexes have been created. In the CDM project they are in CommonDataModel/PostgreSQL/OMOP\ CDM\ postgresql\ indexes.txt . You can look for them in postgres with a query to the information_schema. Adjust this for your schema name:

select tablename, indexname, indexdef from pg_indexes where schemaname = ‘cdm’;

-Chris