ETL development for Vocab 4.5 and beyond

DTorok · February 26, 2015, 3:17pm

I think the rule in vocab 4.5 and beyond, that the concept domain type should guide what OMOP table the medical event is stored in is a game changer for those writing ETL code. I plan to change my approach. I am thinking of creating a ‘mash-up’ table that combines attributes from the Condition, Procedure, Drug and Observation tables and ETL’ing source data into this ‘mash-up’ table. Then writing views on top of the ‘mash-up’ to express the various OMOP tables based on the concept domain.

My question is, “is it worth it for the community to spec out this mash-up table?” As a developer
I typically loath it when the specification tries to force an implementation.

However I see and upside if there is an agreed upon ‘mash-up’ table:

ETL definition will be simpler because instead of documenting how an
event from the source diagnosis table can be mapped into all the
possible tables based on the domain, it will only have to define the
rules into the ‘mash-up’ table.
Greater consistency in the ETL implementation will likely lead to
more consistent results in the OMOP tables.
We can develop one set of views to convert the mash-up into the OMOP
representation.
I’ve seen two rule base systems that attempt to make it easier to develop ETLs into OMOP, that will now not function. Instead of defining a rule from the source diagnostic table to the Condition Occurrence table, they will now need to accept rules into multiple tables and then to pick which rule to use based upon the concept domain.

Downside:

Another community documentation effort.
Less direct understanding of source to final destination tables in
OMOP CDM
Standard will be too late for those already developing ETLs based on
vocab version 4.5 and beyond.

I doubt that I am the only one to think about implementing ETL’s in this manner. Looking for comments on this approach if someone has already tried it, and if successful, sharing of their ‘mash-up’ table definition, maybe as a de-facto standard.

DTorok · September 26, 2016, 2:11pm

I have a proposed mash up table.unified_cdm5_table.xlsx (34.9 KB)

Christian_Reich · September 26, 2016, 3:06pm

@DTorok

Wait. You want to do that on the vcabulary side? I thought you were talking about an ETL approach.

Mark_Danese · September 26, 2016, 4:31pm

@DTorok-- very good idea, and very timely. Perhaps others will put their approaches out there as well. We all discussed this at the OHSDSI meeting a year ago, and we promised to put our approach out there. I know I have been delinquent about doing that.

Our solution was to create our own “generalized” CDM to use as a stand-alone for our own research with our software, as well as a starting place from which to go to OMOP or mini-Sentinel (or PCORnet). We are about 80% of the way done with the draft manuscript. It will go on arxiv.org and I will post a link once it is done.

jenniferduryea · September 28, 2016, 4:02pm

@DTorok - I’m all for greater consistency across ETLs. And I’m interested in what you are proposing. However, I downloaded your xlsx file and I am a little confused as to how to implement this or how this is supposed to be used. Do you create a tab for each source column you are ETL’ing in? For example, if you have a column in the source data called “source_code”, you would create a tab in the xlsx document that specifies the type_concept_id and any domain-specific fields for each table “source_code” could map to (similar to what is under “CDMv5” tab)? How is that different than adding that information to the ETL spec (under multiple tables)? Or am I totally missing the point? Thanks for starting this conversation!

Sigfried_Gold · September 28, 2016, 5:11pm

Hi Don,

I’m also curious and confused about this but haven’t been involved in ETL
or CDM discussions so that’s probably the cause of my confusion. But I’m
wondering, if you’re talking about ETLing straight into the mashup table,
does that mean that you discard domain-specific values (e.g., days supply,
lab results)? Based on your excel file, maybe you’re talking about ETLing
analysis results rather than source data?

Thanks!
Sigfried

DTorok · September 28, 2016, 5:53pm

The mash-up table columns are in column Y. There is a column days supply.
So, this is about ETLing source data. Currently ETL has to account for the
fact that a procedure, or condition codes can map to multiple domains. For
example, a condition may map into the condition,observation or measurement
domains. One way to handle this is to have three different insert
statements, one for each possible domain. All I am suggesting in the
mash-up table is a single table that contains all the columns from the
condition, observation and measurement tables (note a lot of those columns
will overlap), then you can use a single select statement to populate the
mashup table. The domain of the ‘Maps to’ concept goes into the column
event_concept_id. To populate the actual OMOP tables you select out of
the mash-up table based upon the
event_concept_id.

Sigfried_Gold · September 28, 2016, 6:20pm

Sorry – this is embarrassing. I was looking at the wrong excel file. I see
what you mean now.

ericaVoss · September 29, 2016, 6:18pm

@DTorok,

@schuemie took at stab at implementing this idea in RIAH but it is not well documented yet.

We would welcome feedback on how it was implemented!

DTorok · October 25, 2016, 9:40pm

We are working on an ETL into OMOP CDMv5. In an earlier post I suggested the ETL might be easier if we created one big mash up table that held all the medical events. Many columns such as the event concept id, event source code, event source concept id, visit occurrence … overlap and can have a common name in the mash-up table. The hoped for gain is we do not have to worry about the target domain we selecting from the source data. We can write one statement against the condition or procedure codes in the source data and have them go into this single table instead of having to code for conditions that map to conditions, then conditions that map to procedures, then condition that map to observations … After the mash-up table is filled we can write views that represent the OMOP CDMv5 event tables (condition, procedure, drug …) keying off the target domain.
In writing the view it is very apparent how messy the Type Concept (condition, procedure, observation type …) are. When a condition maps to a procedure there are 99 different condition_type_concept_id and 93 different procedure type concept id’s but a lot do not line up, for example there are 21 variations of Inpatient Detail where the concept_class_id = ‘Condition Type’, but only only two Inpatient Detail variants where the concept_class_id = ‘Procedure Type’. Having to map from Condition Type to Procedure Type concept id is not specific to this ETL approach, but when trying to account for all the possible mapping of Condition Type into Procedure Type you see how messy the Type Concepts have become.

Chris_Knoll · October 26, 2016, 12:48am

Hey, @DTorok:

I’ve been in that situation of tryign to figure out what concept type to put into the target table based on the source table that it came from. I haven’t done it with a mashup table, but what if you added columns to your mashup table that is {domain}-concept-type such that when you create your views you look at the right concept_type column for the specific target domain? All you’ll need to do is specify what concept type to use based on what target domain that event source should map to.

Does that work?

PS: i agree there’s tons of those type concept IDs but based on the type of source data, you should only need 1 or two type concepts per domain (maybe you get some of your data from a EHR record and another set of data from parsing out notes for example). But the variety of type concepts is just reflecting the cdm wanting to capture as many different source types as possible, and shouldn’t mean that you have to account for all possible types of concepts.

DTorok · October 26, 2016, 3:28pm

If I understand your suggestion, I would need a case statement with
something like
CASE target_domain
WHEN Procedure THEN
WHEN Measurement THEN
WHEN Observation THEN

Which I guess would work, but will require hard coding the mapping from one
event type domain to another. My solution is to create a table that maps
from the default domain type concept to the target domain type concept.
Then the view on top of mash-up table can simple select from this map. It
was in defining the mapping that the inconsistency between types stood
out. Especially for the mapping types for claims records. One solution is
to remove the implied relationship between event type and domain. The
mapping type is suppose to “delineate the origin of the source
information”, so ‘Inpatient Header first position’ or ‘Patient Reported’
would be relevant to a procedure, condition, measurement, drug, or
observation. No reason for have ‘Inpatient Header first position’ for each
domain. This problem was really created with vocab 4.5 when the vocabulary
determined the target table in the CDM. We just never bothered to clean up
the event type concepts.

Chris_Knoll · October 26, 2016, 7:26pm

when you populate the mashup table, yes, you’d need a case statement there, to populate the right X_concept_type_id column with the right concept ID. But that’s just when you etl’d the data into the mashup form. For your views, yo know that when you create the view for PROCEDURE_OCCURRENCE, just read from the procedure_type_concept_id field. Am I understanding that correctly?

DTorok · October 26, 2016, 7:49pm

Your scheme would work, but I did not want to have the big case statement,
nor hard code the mapping from the default domain to the target domain in
the select statement. Instead we determine the event type for the default
domain in the mash-up table. For example if getting conditions from and
EHR system set the event type id in mash-up table to EHR Chief complaint or
EHR Episode, or EHR problem list entry where concept_class is Condition
Type. When defining the view, select into a table that maps the Condition
Type concept to the Type concept for the target domain determined by the
vocabulary.

Mark_Danese · December 12, 2016, 7:34pm

We are almost done and we have the model on GitHub: https://github.com/outcomesinsights/generalized_data_model – we have a few small issues to resolve on the final structure for our “details” tables. And we might be deficient in some of the measurement and drug exposure columns specific to OMOP, which could obviously be added easily.

The reason this might be useful to the OHDSI community is that it provides a way to do all of the data relocation and grouping, prior to doing the visit creation and vocabulary mapping. And if somebody is doing ETLs to both Sentinel and OMOP, it gives a nice stopping place from which they can get to both data models.

Once I get the manuscript done, I will link to it.