(I was a little unsure where to post this but opted for the Researchers category over Implementers, as to my mind this is primarily about evidence generation.)
Imagine you have two similar databases. I won’t define what I mean by similar, but for example two UK primary care EHR databases (let’s call them A and G) collected from systems produced by two different GP EHR software vendors. You are going to OMOP these.
I am interested in views from the community on pros and cons of two approaches:
- create 2 separate OMOP databases
- create a single pooled OMOP database
My clear preference is option 1 - partly based on my understanding of the OMOP CDM and OHDSI tools and approach, but also partly instinctive. So for balance I have tried to set out arguments for both, and invite others to weigh in on either side, correct my misunderstandings, identify technical challenges etc.
ETL process: Option 1 is simple: separate ETLs for each database. The ‘similarity’ of the native databases (e.g. data models, vocabularies) determines how much ETL code can be shared. For option 2 a simple approach would require separate ETL for each data source, and additional steps to combine the two at the end. A third option might be to adopt option 1 and use views to represent a virtual pooled version.
ETL fidelity and data quality assessment: Option 1 is more straight forward - there is a 1:1 relationship between source and OMOP databases. Option 2 requires additional steps to identify from which source any issue arises.
Speed / performance: i.e. is it quicker or slower to run two analyses serially on smaller databases, compared to a single run on a large database? I suspect this depends on a lot on the nature of the queries among other things. Separate analyses could potentially run partly in parallel, but would need an additional step of combining results. What would performance be like for a virtual pooled version.
Flexibility: Option 1 provides greatest flexibility: analyses can be conducted in either or both data sources. Option 2 only allows for analysis of both (at least using standard OHDSI analytics)
Assessing database heterogeneity: This feels like a very important one. It’s almost foundational in OHDSI that asking the same question in many different databases yields slightly (sometimes wildly) different answers. Conceptually at least, some heterogeneity is explainable (desirable even) - because the data sources cover different care settings, different populations, different national health systems. Data capture methods, accuracy of ETL and vocabulary mapping are potential sources of less ‘explainable’ heterogeneity. An essential step in evidence generation for OHDSI studies is therefore presenting and exploring heterogeneity, and where appropriate we may combine results in a meta-analysis. From this perspective Option 1, which affords the option to compare results from each database prior to pooling / meta-analysing the results, seems preferable. Option 2 seems to make a strong assumption that there are no differences between the component data sources important enough to be worth exploring.
Combining data/results from different databases: Where between database heterogeneity is not too great we can combine data or results from different databases to increase study power or precision. A two-stage meta-analysis combines aggregated results (e.g. effect estimate or numerator and denominator for incidence rates) from each database. This approach works with Option 1. A one-stage analysis pools patient level data prior to analysis, and is possible only with Option 2 (using standard OHDSI analytics). Both approaches can account for heterogeneity and in most circumstances should give similar results, but 1-stage analysis is generally preferred where feasible. So Option 2 wins here out of the box, though the trade-off is the inability assess between database heterogeneity. I also imagine that with some tinkering in R, or using views to combine CDM tables it should be possible to also do a one-stage analysis with Option 1.
I’d be really interested to hear views on the above, and on related questions such as:
Whether and under what circumstances might it be appropriate to create an OMOP database from two or more different data sources?
Are there examples where something similar has been done?
Does creating a virtual combined database using views seem a viable option?