Distinguishing provenance

MPhilofsky · August 3, 2018, 8:30pm

Hi Friends!

Seems I always have questions about distinguishing provenance

We are ETLing 2 EHR systems into one OMOP CDM. Yes, I know why we don’t want to do that, but that’s what we are doing. So, how do others distinguish one source from the other? And, more importantly, how should we do it so the OHDSI tools can distinguish the difference?

parisni · August 3, 2018, 10:54pm

can you elaborate on this, I wonder why merging EHR into one instance of OMOP would be unnatural ?

RIght now we are not distinghishing provenance in our omop instance that have 2 EHR as source. If we had to do, this would be at fact level (ie: for each information), because some admissions can come from EHR1 and other from EHR2.

I guess the information is valuable since practices and data quality might change from one to other EHR and having the information would help in capturing and explaining analysis gaps.

Christian_Reich · August 4, 2018, 2:12pm

What @MPhilofsky is referring to is when we create an OMOP database we don’t pool different sources. We keep them separate. Unless we pool, and then we treat them as one source. To pool, and then to keep it separate at the same time, has been discussed and dismissed before. But it keeps coming back.

@MPhilofsky: Make a proposal to V6. Figure out a way to flag each record without creating a gigantic bureaucracy. Also, try figuring out what you do if you merge data from sources, like DRUG_ERA or VISIT records for the same patient.

If you feel like that is too much of it (which would be my feeling) you can still use VISIT and/or CARE_SITE and then keep an external reference table which tells you which one is which. Seems easier to me.

Andrew · December 3, 2018, 1:33pm

@Christian_Reich I am surprised that I am learning of this convention for the first time. Does this only apply when there is presumed to be little overlap in patients between 2 sources? If an institution has lots of data on the same patients from more than one EHR or other source, what is the rationale for not matching and pooling? The choice nto to pool would be a commitment to known and avoidable incompleteness, right?

The means for recording provenance within institutions that have more than one data source, and across institutions in a network, is being advanced in the Metadata and Annotations WG. There one of the use cases is data quality/benchmarking across institutions. Another case, as you know, where pooling happens and the need for distinguishing provenance is in oncology extension where tumor reg and EHR data collide.

Christian_Reich · December 3, 2018, 1:44pm

@Andrew:

Hang on. Pooling is not recommended if the sources essentially have nothing to do with each other: Different or unlinkable patients, different data capture mechanisms, different health care settings. Pooling will not add much value and it will be harder to identify and control for artifacts in each of the sources.

What you are talking about is to create a longitudinal record for the same patient. Very different animal. Please do that. The place to keep provenance is the Type Concept. So, if a patient receives drug as administered by the physician or nurse, or during an operation, or during a special diagnostic procedure, or reports which pills she is prescribed - all of that should go into DRUG_EXPOSURE, but the drug_type_concept_id should reflect the different origin.

Patrick_Ryan · December 3, 2018, 1:54pm

@andrew, i think the language here could be ambiguous to other newcomers in the community, so let me try to restate @Christian_Reich’s comment:

One CDM instance represents one observational database that contains a set of persons with some capture of clinical observations about those persons. Some organization have access to multiple observational databases (ex: one could license CPRD, MarketScan, Optum, PharMetrics), and our recommendation is to maintain each of these disparate populations as separate CDMs. There has been occasional discussion that comes up from time-to-time where someone considers instead building only 1 CDM instance that contains all data from all sources (e.g. stack all persons from CPRD, MarketScan, Optum, PharMetrics in one massive database), and then they ask “where’s the field in the CDM tables that lets me preserve provenance from which source the patient came from?”. The answer: there are no such fields in the OMOP CDM, because this is not recommended behavior. Instead, we recommend that you treat each database as a separate collection of patients and distinct vantage point of the healthcare system and associated data capture process. Rather than running 1 analysis against an amalgamated database, we suggest you run the same analysis consistently across each database, and then you can synthesize the evidence that arise from your data network.

But I want to separate this notion of ‘pooling populations’ from the idea that @Andrew is raising, which is ‘linking persons’. Certainly, it is reasonable (and generally expected) that a given population may have patient-level data that might come from disparate sources. Simple examples: an administrative claims system is typically made up of disparate fields of medical service claims and pharmacy dispensing claims, and may further be linked with laboratory measurement data or health risk assessments. A clinical registry may represent multiple data feeds brought together at the person-level, and registry-claims linkages (like SEER-Medicare) combine these ideas together. And if you are looking to maintain this type of provenance (e.g. ‘where did the clinical observation for this patient come from?’), then that’s the explicit intent of the _TYPE_CONCEPT_ID fields in every OMOP CDM table.

Andrew · December 3, 2018, 2:01pm

@Christian_Reich and @Patrick_Ryan
Thanks. I was, as I suspected, applying the comment about not pooling too broadly.

rimma · December 3, 2018, 3:46pm

When we built the NYC CDRN integrated OMOP instance (composition from six major NYC hospitals), we had to overcome CDM limitations related to this convention. One of them was choosing most granular and up to date demographics data. The solution was based on preponderance conventions we established. Provenance of demographics was not preserved. @MPhilofsky, if you are interested I could pull those preponderance rules and share with you. I’d be also interested in joining you if you plan to work on the provenance proposal for v6.

Generally, I understand the rationale for not mixing data sources with overlapping patient data. However, the claim of OMOP CDM being patient-centric should be supported with structures and conventions so that creating most complete longitudinal patient record would be possible whether it is sourced from one or multiple/heterogeneous data sources. The NYC-CDRN and specialized cancer hospital use cases are perfect examples when integration from multiple sources create a better and more complete patient health record.

MPhilofsky · December 4, 2018, 10:48pm

Colorado decided to keep the children and adult hospital’s EHR data separate and use a MPI to link patients when needed. We did include state death data in both the children and adult instances of OMOP. This was easy because the type_concept_id for death distinguishes the provenance.

I’m working on another dataset that is combining children’s hospital and adult hospital data into one CDM instance. We are distinguishing the two sites by using the Care_Site table. We have had to work through many decisions regarding differing data for the same CDM TABLE.field. Another issue we have discovered is even though these two hospitals use the same EHR, they do NOT use the EHR in the same way. Workflows differ between the two institutions as well as the use of EHR tables and fields. This makes it not really the same data. But, then again, no data are the same. Only use of the CDM will tell if combining EHR data sources is more detrimental than beneficial.

@rimma I would like to take a look at the preponderance conventions. Please forward to me At this time I don’t plan to work on the provenance proposal. Using the Care Site table will work for the use case