OHDSI Home | Forums | Wiki | Github

Anyone storing multiple data sources in one CDM?

All,

We are working on establishing standards for storing CDM metadata. We were wondering if anyone is currently hosting multiple data sources in one CDM database.

If so:

  1. Can you share some details about your design?
  2. Do you include records for each of the sources in the CDM_SOURCE table?
  3. Do you utilize some method of delineation for these sources in the CDM fact tables?

Thanks,
Ajit

Tagging: @ericaVoss @Vojtech_Huser @dgambone @t_abdul_basser

1 Like

@Ajit_Londhe,

We are starting to populate our first CDM database. We haven’t decided if we will combine data from different sources into one CDM or use only one source to populate a CDM instance. We are also interested in what others are doing, so please include us in the discussions.

Thanks,
Melanie

Adding @mgkahn

@Patrick_Ryan did mention to me yesterday that this is allowed - as we saw in the documentation. But I still struggle to understand why it is necessary so I’d love to hear to examples to understand.

@bailey or @rimma do you guys have multiple CDM_SOURCE information? I know the IMS folks like @dgambone could argue for it - but I still see it as one record per CDM.

Looking forward to people’s thoughts.

My two cents:

In terms of content: if you have multiple sources that feed information
for a given patient which is linkable across PERSON_ID and covers the same
OBSERVATION_PERIOD, then there’s likely value from an analytical
perspective in bringing those sources together, since it should provide you
a more complete lens into a person’s medical history, thereby improving
your ability to classify exposures, outcomes, and other relevant covariates.

If you have multiple sources that feed different patients, then the choice
of whether to bring together or keep separate should be driven by your
analytical use case and whether you are comfortable with pooling under the
assumption that the different sources and different patients are
effectively comparable. In my experience, most data sources I have worked
with have grossly violated that assumption (2 EHRs are never alike, nor are
2 claims databases). Therefore, for my analytical use cases - clinical
characterization of disease natural history and treatment utilization,
population-level effect estimation, patient-level prediction - I recommend
keep different sources with different populations in different CDM
instances. Then, after an analysis is performed on each CDM, you can
evaluate whether it is appropriate to pool the results (but not the
patient-level data) for any given analysis. But if you opt to pool all
data into one instance, then you won’t be able to back out the
source-specific data unless it cleanly divides by the TYPE_CONCEPT_ID in
each domain.

As a general principle within OHDSI network studies, we expect there may be
heterogenity across the data network and require that we be fully
transparent, and therefore we will always report source-specific estimates
and consider if a composite summary is additionally warranted, but we won’t
only report one aggregate summary.

Great explanation, @Patrick_Ryan! We are in the beginning stages of loading the CDM. So, this discussion is very timely. We receive EHR data from a children’s and university hospital. And we have data requests for the course of treatment or outcomes for one PERSON_ID that was seen at both hospitals during the same OBSERVATION_PERIOD.

My questions: Would putting the EHR data from both sources into one instance of the CDM now be better from a data ETL/management perspective? Or would merging the two into one CDM in the future not be too terrible from a time/resource point of view? Pros & cons, anyone?

Thanks!
Melanie

I somewhat agree with @ericaVoss that most sites would most likely have one row in the CDM_SOURCE table.

For consensus for METADATA, I would focus on what majority of OMOP CDM adopters typically do. Any site is always free to have extra columns in any table if they see a need for it.

For our SAFTINet network of federally qualified health centers (FQHCs )we addressed two issues of multiple data sources

  1. For some partners we modeled linked clinical (EHR) and claims data into a single CDM db
  2. Some partners jointly used a single CDM db to hold data from several unique, but collaborating, health centers

We added 2 fields to most tables to help track this both to allow us to more easily id ETL errors and for data use: (we labeled non-OMOP fields with ‘x_’)

x_data_source_id
integer
not null
Unique numeric identifier for each data source

x_data_source_type
varchar(20)
not null
Data Source Identifier (EHR / CDW / Medicaid) don’t need reference table (just code it)

table { }.font5 { color: black; font-size: 9pt; font-weight: 700; font-style: normal; text-decoration: none; font-family: Tahoma,sans-serif; }.font6 { color: black; font-size: 9pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Tahoma,sans-serif; }.font7 { color: black; font-size: 9pt; font-weight: 700; font-style: normal; text-decoration: none; font-family: Calibri,sans-serif; }.font8 { color: black; font-size: 9pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Calibri,sans-serif; }td { padding: 0px; color: black; font-size: 12pt; font-weight: 400; font-style: normal; text-decoration: none; font-family: Calibri,sans-serif; vertical-align: bottom; border: medium none; white-space: nowrap; }

Hi Melanie,

I think that if the populations are inherently similar (1 augments the other), you could merge the 2 into 1 CDM, but with qualifications of the fact records; location of service or type_concept_ids could help retain the ability to delineate between the two sources at a micro-level. However, if division of the datasets is truly necessary for analyses (or if the 2 populations are significantly different), I would suggest making 2 CDMs.

In the case of 1 merged CDM, I think the CDM_SOURCE table can still just have 1 record explaining the build features of the overall database.

Thanks,
Ajit

We maintain a single CDM instance of data from multiple sources, partly for logistical reasons, and partly to facilitate pooled analyses. At some level, I suspect almost all CDM instances are from multiple sources; the question isn’t whether it’s done or not, but how well we can identify them. The type_concept_ids are a big help in tagging different sources of data (billing vs ordering, admin vs prescribing, etc.). We also add a convenience site column to several of the tables to let us constrain queries to particular institutions without having to link through care_site or provider.

We’ve found that the database management is easier using a single CDM. It does take additional coding to do an analysis against just some sites or just some types of data, but most of our analyses include multiple sites, so it’s been overall a win for us. But if you anticipate most analyses covering only a subset of data, it may be better to separate them up front.

I don’t think it’d be a big deal to merge CDMs later, as long as the key spaces don’t overlap, or to split a single CDM, as long as you’ve kept enough metadata to partition on the criteria you want.

Hello,

Can you please help me with the below post?

To use the standard tools, you’d want 1 CDM instance

t