OHDSI Home | Forums | Wiki | Github

Capturing high-level information about source systems and the data they contributed to an OMOP CDM instance


OMOP newbie here. I looked around for answers to my questions in various OHDSI sources and this forum, but did not yet quite find what I was looking for. Hence the new topic.

I am trying to understand how to capture high-level information about which source systems contributed data to a given OMOP CDM instance (in a context where multiple source systems are involved). Specifically, I am wondering if it is possible to capture (inside the OMOP CDM itself) answers to the following questions that CDM users might asks:

  1. Is data from source system X (typically not OMOP itself) included in the OMOP CDM data?
  2. For which (clinical) time range has data from source system X been included in the OMOP CDM?
  3. Have all data elements from source system X been mapped to OMOP or are some missing/incomplete? Are some element only available from a certain point in (clinical) time onwards?
  4. Are there other facts about the data coming from source system X (possibly only for a specific time interval) that I should be aware of as an OMOP CDM user? This could e.g. potentially confusing changes at a specific point in time due to a changes in the code systems used in the source system, or data elements that have been imputed or set to dummy values in certain cases.

As a first step, it would be enough to capture this information as a single text blob for each source system, without linking to specific CDM tables or columns. If I understand correctly, the METADATA
table would be the right place for such information. I see some discussion of the usage in the GitHub issue proposing the METADATA table,including topics pertinent to my question (3) (link to specific comment). However, it is not quite clear to me how to capture high-level information like the above correctly (e.g. for CDM v5.3 documentation of METADATA, the “User Guide” column is empty) . Are there guidelines/examples available on this?


Welcome to the family.

The question (as usual) is: what is the use case? What will an analyst, especially an external one who wants to use data in the OMOP CDM for an analytical purpose do with this information?

Generally, all detail and struggle to get source data into the CDM lies with the ETLer. Only he has the context of being able to interpret what these source system contain and what they carry and what is missing. The users of the system expect as much as possible a Closed-World system, in which we can rely on the records representing true facts, and, equally if not more important, the lack of records of certain kind meaning such things did not happen. Only both assumptions allow for true rate calculations (what’s the rate of XYZ in your population).

Thanks for the welcome & for picking this up!

The kind of use case I have in mind is a CDM system that is being fed from multiple clinical source with the goal of supporting secondary research use. The aim is to support many different research questions that may require different types of data. The target users would most likely primarily be researchers associated with the institution running the system (but not involved with the ETL). They will be using the system for different projects (many of which may not be classical clinical studies) and the set of active projects will constantly be changing. Hence, a key question (especially for new users) is: does the system contain the data I need for my new research project X? The research users would try to answer this using the questions I mentioned in my initial post. In the end, it would help them decide if their project is feasible and, if not, what “data pieces” are missing.

As you say, the detailed data provenance story is intertwined with the ETL and can hardly be easily captured in the CDM itself. But as a newcomer, I was wondering whether some (simple) aspects of this information can be provided directly in the CDM itself, to provide some measure of “self-documentation” for the research users. Or is this typically documented entirely outside the OMOP CDM instance?


I totally understand. This keeps coming up from time to time. People want to collect provenance information. When the dismissive answer above doesn’t work, I usually try two things:

  1. Explain again, that the OMOP CDM contains all data from the various systems, and they should use them all, irrespective from the source.
  2. Add some field “provenance” or “provenance_concept_id” to your fact tables, indicating the source system.

Solution 2. will not break any tools or statistical methods. It is not standard, but it doesn’t matter - it only has a meaning inside your organization. External folks wouldn’t see it, and wouldn’t know to query that field anyway. Everybody is happy. Except when a fact is in more than one system. You don’t want to create duplicates where the only difference is what’s in the provenance field. You may then create combo provenance values (e.g. “EHR+cancer registry”). Ugly, but again, that’s dirty laundry nobody will ever see from the outside.

1 Like

Other than now having 80’s songs stuck in my head…
I have expanded several of the OMOP tables for my ETL process. No one outside of the ETL team (big size of 2) even know that they exist and it works great.

@Christian_Reich Thanks for the tips! - I will put on my pragmatic glasses and see what could work.

For my CDM-understanding: What kinds of information are meant to be stored int the METADATA table? In the GitHub thread related to the creation of that table, one comment (link in my initial post) mentions storing quite specific information related to changes in the ingested data, e.g. “Loss of access to Social Security Administration Death Master File” or “ICD9CM to ICD10CM migration” - hence my initial idea that the info I was looking at could also go there. Is the difference that such examples of METADATA information are not directly related to the ETL/Provenance, but rather to (potentially surprising) “internal” facts about the data in the CDM?