OHDSI Home | Forums | Wiki | Github

Combining two or more data sources in a single OMOP database: pros and cons

(I was a little unsure where to post this but opted for the Researchers category over Implementers, as to my mind this is primarily about evidence generation.)

Imagine you have two similar databases. I won’t define what I mean by similar, but for example two UK primary care EHR databases (let’s call them A and G) collected from systems produced by two different GP EHR software vendors. You are going to OMOP these.

I am interested in views from the community on pros and cons of two approaches:

  • create 2 separate OMOP databases
  • create a single pooled OMOP database

My clear preference is option 1 - partly based on my understanding of the OMOP CDM and OHDSI tools and approach, but also partly instinctive. So for balance I have tried to set out arguments for both, and invite others to weigh in on either side, correct my misunderstandings, identify technical challenges etc.

ETL process: Option 1 is simple: separate ETLs for each database. The ‘similarity’ of the native databases (e.g. data models, vocabularies) determines how much ETL code can be shared. For option 2 a simple approach would require separate ETL for each data source, and additional steps to combine the two at the end. A third option might be to adopt option 1 and use views to represent a virtual pooled version.

ETL fidelity and data quality assessment: Option 1 is more straight forward - there is a 1:1 relationship between source and OMOP databases. Option 2 requires additional steps to identify from which source any issue arises.

Speed / performance: i.e. is it quicker or slower to run two analyses serially on smaller databases, compared to a single run on a large database? I suspect this depends on a lot on the nature of the queries among other things. Separate analyses could potentially run partly in parallel, but would need an additional step of combining results. What would performance be like for a virtual pooled version.

Flexibility: Option 1 provides greatest flexibility: analyses can be conducted in either or both data sources. Option 2 only allows for analysis of both (at least using standard OHDSI analytics)

Assessing database heterogeneity: This feels like a very important one. It’s almost foundational in OHDSI that asking the same question in many different databases yields slightly (sometimes wildly) different answers. Conceptually at least, some heterogeneity is explainable (desirable even) - because the data sources cover different care settings, different populations, different national health systems. Data capture methods, accuracy of ETL and vocabulary mapping are potential sources of less ‘explainable’ heterogeneity. An essential step in evidence generation for OHDSI studies is therefore presenting and exploring heterogeneity, and where appropriate we may combine results in a meta-analysis. From this perspective Option 1, which affords the option to compare results from each database prior to pooling / meta-analysing the results, seems preferable. Option 2 seems to make a strong assumption that there are no differences between the component data sources important enough to be worth exploring.

Combining data/results from different databases: Where between database heterogeneity is not too great we can combine data or results from different databases to increase study power or precision. A two-stage meta-analysis combines aggregated results (e.g. effect estimate or numerator and denominator for incidence rates) from each database. This approach works with Option 1. A one-stage analysis pools patient level data prior to analysis, and is possible only with Option 2 (using standard OHDSI analytics). Both approaches can account for heterogeneity and in most circumstances should give similar results, but 1-stage analysis is generally preferred where feasible. So Option 2 wins here out of the box, though the trade-off is the inability assess between database heterogeneity. I also imagine that with some tinkering in R, or using views to combine CDM tables it should be possible to also do a one-stage analysis with Option 1.

I’d be really interested to hear views on the above, and on related questions such as:

Whether and under what circumstances might it be appropriate to create an OMOP database from two or more different data sources?

Are there examples where something similar has been done?

Does creating a virtual combined database using views seem a viable option?

Hello @dandedman,

All roads in OHDSI lead to evidence generation! The CDM Builders forum is probably the most appropriate place for your question, but you will definitely get some replies here.

One data point you didn’t address in your post and is necessary to know, do these two databases overlap in the patient population? Could one person have data in both datasets? If yes, you definitely want to combine the two datasets into one OMOP. Observational, longitudinal data assets provide the best insights when you have a full picture of a person’s healthcare journey.

Oversimplified answer:

Is there any data that can be duplicated between the two(e.g. person_source_value)? If so either the data should be in one OMOP instance or the ETL’s must be coordinated so that the duplicate data has the same ID’s between the two. Coordinated ETL’s is most of work anyway of putting two sources into one OMOP instance.

Speed is a different issue. There is no way of knowing without testing. Different flavors of DB, as well as both the hardware and the db settings, change result speeds drastically. Worry less about this, unless you hit a roadblock, and instead worry about data integrity.

Thanks @MPhilofsky
Interesting that both you and @Mark asked about duplication, and the last thing I did before posting was remove a sentence about that…
In the UK all individuals a registered with a single provider (a primary care or general practice) at any one time. When an individual moves between providers we generally cannot track that, so they are represented in the database as two different individuals with non-overlapping observation periods. There is a second overlap situation where a number of providers contribute data to both databases. This occurs when provider changes software vendor and migrates historic data into the new system. In this situation data for all patients registered with that provider up to the time of migration is represented in both databases. Since we know which providers are affected and the date of migration, for studies requiring both data sources we typically remove data for those providers in the database with the shortest observation period. I agree it is an important point, and I expect we would deduplicate in this way for both Option 1 and Option 2 (hence why I removed the comment from the original post).


This question you are bringing up is important and keeps popping up all the time. If course in most cases, databases behave like fractals: When you say “combine two databases” then you make it sound that each database is a clean and homogenous thing. When in fact it is itself also a combination of databases (or data feeds). In your case, it’s the feeds from the providers, and each provider is in a way his or her own healthcare setting with his or her own population, data capture habits, accuracy and the like.

Of course there is probably greater variability if the providers use different EHR systems. Which gets us to the core of the problem: A lot of these differences are not fundamental differences of what happens, but how things are captured. Apparently (and we know that’s the case) the data capture is far from that perfect Closed World model: Everything that happens is recorded, and what is not recorded did not happen. I.e. perfect sensitivity and specificity of all clinical events. We know it is not true, but we base all our methods and statistics on it as if it were (except when we phenotype).

Why am I saying all this: When you can combine I think you should combine. The differences may be good for some sensitivity type analysis, but mostly they are artifacts of the electronic data capture systems. The larger your data the greater the chance that you can dilute these artifacts.

Thanks @Christian_Reich. Very interesting. There seem to be at least two concepts in play here: systematic vs random variability, and measurement error.
You make a good point about different providers and the inherent multilevel nature of many databases. At the level of individual primary care practices, we can consider these as randomly selected from a universe of all practices, and combining in a single database makes sense, and ignoring the multi-level structure is usually OK. Which is lucky because it quickly gets complicated otherwise.
As you suggest above, the real problem for observational databases is measurement error (and other assorted biases). In that scenario combining data gets you a more precise, but still biased, answer. Sensitivity analyses can be very important for assessing direction and extent of bias.
Going back to the databases built from different EHR software, it seems there is at least some scope for systematic differences which could impact on measurement error. They may differ in how much the user interface encourages use of coded entries (which we collect) rather than free text (which we do not collect), or in the use of proprietary codes which are harder to map. And the way we process each data feed and apply data cleaning or other quality assurance metrics is not identical. The truth is we don’t know whether and when these differences might be important. Wouldn’t it therefore be prudent to retain the ability to explore differences in the databases prior to combining them?

For Option 2, where a single pooled database is created, understanding the heterogeneity between the data sources is crucial. You might need to carefully consider and control for potential differences in data models, coding practices, or patient populations between the two sources. Creating a single pooled database (Option 2) might require more extensive standardization efforts to ensure consistency in data representation. You’ll need to harmonize vocabularies, units, and other data elements across the different sources.