Has anyone implemented database partitioning schemes on the omop CDM tables ( condition_occurrence, etc )? I am running into some performance issues with a large dataset.
Thanks,
Richard
OHDSI Home | Forums | Wiki | Github |
Has anyone implemented database partitioning schemes on the omop CDM tables ( condition_occurrence, etc )? I am running into some performance issues with a large dataset.
Thanks,
Richard
we partition by patient
sorry, I meant groups of patients – we sharded our database. depending on server and memory the size of the shards might vary.
We are running the microsoft parallel data warehouse in our environment so your mileage may vary but I concur with @Mark_Danese, we do our distributions by person_id which would be a partition strategy that I would recommend for a traditional relational environment as well. I have done performance testing in a number of environments and person_id was the winning strategy in each. Many analyses join data across different tables on person_id so partitioning on person_id tends to improve most workloads.
Thanks, that does seem to be the best way. I had some concerns about “where diag = xxx” type of queries, but so many things are based around the cohort and person_id.
Would compiling a technical description of systems and sizing information be useful to help newcomers determine their hardware/software needs? I imagine there is a wide range of solutions people have developed
Thanks for your help!