OHDSI Home | Forums | Wiki | Github

Extending the CDM with additional source columns vs linking back to source tables


Usually, there is a situation where one may need additional columns in the CDM that are not part of the current standard. This is particularly important when customers are used to data that may not be part of the standard. Is there any guidance regarding best practices to do so?

Given that any source data (in our case Clarity) contains possibly hundreds of columns in the source tables, How is the community dealing with this? Are we bringing as many columns as possible or are we selecting a subset?

The other option is linking back to source tables, which I call a reversed ETL. I believe this is a considerable effort. Had anyone has some experience with this?


Could you provide some examples. Typically we would add columns to the OMOP table. If you wanted to reference back to the source data you can add columns to hold the source table and key. Or there are some specific columns in the Clarity that are relevant to the subject of the OMOP table then add those columns. However, beware of ‘customers are used to data not in OMOP’, the principal goal of OMOP is a standard common schema and making OMOP look like the source data will backfire in the end.

1 Like

Hi Don,

Thank you for your answer. I like the idea of adding columns to hold the source table and key. What about storing the primaries keys from the source table and not adding any columns?
I think this will somehow preserve the CDM without making it look like the source, but enabling a straightforward query to retrieve what is needed. Also I think this may be a very interesting case that may become part of the standard, since it enables complete traceability without a reverse ETL. Thoughts?

Again, without examples of what tables/attributes you are talking about it is difficult to give a specific answer. I assume your suggestion is to use the primary key from a Clarity table as the primary key in OMOP. This may work for Person where you might expect a 1 to 1 correspondence. But will fail in the medical event tables because there will be source concepts that map to more than OMOP concept.
ICD10CM S31.629A ‘Laceration with foreign body of abdominal wall, unspecified quadrant with penetration into peritoneal cavity, initial encounter’
maps to
target_concept_id: 4205673 ‘Intra-abdominal foreign body’
target_concept_id: 4051140 ‘Open wound of anterior abdominal wall’
So you have to create 2 records in Condition Occurrence, but only have one PK from source data.
Other problem, as you stated Clarity has hundreds of tables, so you will probably need to get conditions or procedures from a number of Clarity tables. PKs in these source tables likely to overlap. I think idea of using PK from source tables as PK in OMOP tables will fail.

Ouch! The standard is a standard is a standard. If you add columns you cannot have standardized queries. You cannot have tools. You cannot have network studies. And there is no need: If we need to cover something that is not covered we add. As @DTorok said: What is it you are missing?

I have always thought that adding a column to a table is allowed and that it will not cause any of the OHDSI tools to break. For example we might add columns to hold the source table name and PK. Why is this a ‘standards’ problem?

Don - agree with you that adding column does not break the model. It just any query that uses these extra columns would naturally not work across multiple organizations - but not users of OMOP CDM use it for the purpose of network studies. ATLAS would not also be able to take advantage of these extra fields. So, people just need to be mindful of the limitation of using fields that are not a part of the OMOP CDM standard but there are some legit use cases.

In Jose’s case, they are adding some extra columns that carry extra info - and I will let Jose share this information - and those are being used by internal teams, including for traceability back to the original source.

and @jposada - Christian has raised a good point through. The OMOP CDM is very extensive as-is - would be very interesting to dive deeper into what fields researchers want to add, understand why and think whether there is a place for some of them in OMOP CDM already.

I wonder if there isn’t a bit of semantic dissonance going on here. CDM stands for Common Data Model, emphasis on “common”. You can’t have individual changes to a common data model. Then it’s no longer common.

On the other hand, your implementation of the CDM, (i.e. your database), can have valid reasons for having extra fields, but these are not part of the common model even if they’re in the database.

The source_value and source_concept_id fields in all of the OMOP tables do add a measure of look back to the source, and they are already part of the CDM.

We have added additional fields to our tables to help us validate our ETL process, but these have no research-related value. I’d like to see examples of where storing the primary key from the source will be useful.

Thank you all for taking the time to answer this post. Our need is derived for internal consumption.
I am going to try to answer the questions to elucidate more on our need.

This is perhaps one of the more challenging aspects.
We may know what are we missing as of today, but as research questions will evolve we do not know what we may be missing to fulfill needs from our internal researchers.

That was something I was anticipating. I think this could be “solved” by giving preference to the PK of some tables?

This is related to my first answer. We served a very heterogeneous population of researchers. We are currently engaging more and more with them to better understand their needs.
As a result, if we see that we can contribute and propose additions we will do so. @Christian_Reich

I do not have many examples. But one that comes to mind is related to clinical notes.
We are only recording one field: note_datetime. This can effectively be the sign off date.
However, if you want to run a study where you are considering the time where the note was first entered in the system before actual signing off you may need that extra piece of information.

And again your answers are very helpful and I really appreciate them. I hope this post also helps other sites like us with similar questions or needs

I’ve seen two simple ways to successfully handle local extensions.

First, prefixing local columns, e.g. stanford_target_concept_id makes it easy to identify queries that use local extensions and also avoids potential future conflicts: OHDSI might take the successful use of a proposed column and adopt it. But with this approach, you should rebuild your tables on release – if the OMOP CDM adds a column, it may break lazy queries and scripts that assume position with a wildcard selection. As the CDM evolves and adds colums, this can introduce pernicious bugs.

The second, safer option, especially if you have lots of columns, is to make a separate facet table, one that has 1 to 0/1 correspondence, by repeating primary key of base table. Prefix the facet table, e.g. stanford_person. Then add columns as one wishes. It’s pretty trivial to add the extra join.

There are other more complex solutions. For example OHDSI could add a JSON column to each table for “local” extensions, then leave these completely undefined. There’s also junk drawer options (shudders).

Thank you very much for the suggestions @cce

If we add any new columns to existing tables, will the OMOP application functionality break?

No. Only thing that will break tools is if you alter existing columns by renaming, or changing their type.

A follow up question: would changing the order of existing columns (but not their names) make OHDSI tools unusable? If I am adding a column between two existing columns, can I still analyze existing columns or should I add all new columns to the end of the table?

Completely irrelevant. You can scatter them randomly. As long as their names, and their fields and types stay the same everything will work.

1 Like

I encountered a similar problem today. I have some extra tables that define devices, which cannot be stored in DEVICE_EXPOSURE (because the table is “too common”). Each device could have multiple measurements linked to it. I wanted to store them in MEASUREMENT, but there is no direct link between our custom tables as well as the MEASUREMENT table.

What I didn’t have on my radar were the measurement_event_id as well as the meas_event_field_concept_id columns in MEASUREMENT. It’s the same for OBSERVATION.
My idea would be to generate some custom concepts (pk>2billion, …, domain_id = ‘Metadata’, vocabulary_id = ‘CDM’, concept_class_id = ‘Table’) and link the PK fields from omop.mycustomtable – …which is in line with the official documentation.

Does anyone have experience with this approach?

You can link records using fact_relationship table