OHDSI Home | Forums | Wiki | Github

Incremental ETL and Vocabulary Update

(Don Torok) #1

Lots of EHR sites want to do incremental updates into their CDM’s, but there is always the problem of how to do incremental updates AND periodically update the vocabulary. The follow is intended to address how to update the CDM with a new vocabulary release without having to do a complete refresh. (Disclaimer, we use the combined event table described below in creating our ETL’s but have NOT done incremental ETLs.)

Say that your ETL, rather than inserting into Condition_Occurence, or Drug_Exposure , instead adds rows to a single wide table that has a combination of all the columns needed to populate all the medical event tables. There are a lot of columns that are similar for all the tables, like the primary key condition_occurrence_id or drug_exposure_id,… And there are columns that are unique to each table. For CDMv5.3 it takes about 65 columns to account for both the common, unique, and cost attributes.

See https://github.com/OHDSI/sandbox/tree/master/CDMv5

This event_ combined table has the following columns:

Default_domain_id – This is the domain (table) that the event will go into if there is no vocabulary mapping for the source code.

Domain_id – This is the domain (table) that the vocabulary says the record belongs in.

The CDM tables are implemented as views that select from the all-encompassing Event Combined table.

See https://github.com/OHDSI/sandbox/blob/master/CDMv5/decompose_event_combined_ddl.sql.

There is a view for each CDM event table, drug_exposure, condition_occurrence …. For example, the drug_exposure view selects rows where the domain_id = ‘Drug’.

This scheme has the advantage that after a vocabulary update you can selectively update rows in the event combined table where either the target_concept_id or the domain_id of a source code has changed with the new vocabulary. And because the views select rows based on the domain_id a row that previously mapped to an Observation, but now maps to a Condition will, via the view, now be found in the Condition_Occurrence view . This addresses the problem of having to move rows from one table to another if the domain_id of a concept changes with a new vocabulary release.

There are other problems to solve with incremental updates such as extending the length of an inpatient visits and potential performance problems with using views instead of tables, but this scheme may help those thinking about doing incremental updates.