OMOP CDM ETL Update Strategies: Incremental vs. Kill and Fill

PatriceAChan · May 26, 2022, 7:33pm

Hi!
We are on the CDM5.3 and I just recently started supporting the ETL and was not part of the original implementation team. We are implementing in Azure.

For the most part, the ETL is a kill and fill process in the staging tables; and then definitely for all CDM schema tables. Additionally, the primary keys of most of the clinical tables use the INDENTITY(1,1) property, so all keys are re-assigned with each load. Some of the largest tables take between 8 and 60 hours to load.

And now, the team is considering using an incremental load. I was told by the members of the original implementation team that the design intension is for the old database to remain available so researchers can save the cohorts. I wrote a query on the COHORT and COHORT_DEFINITION table and both are empty!

I hope an OMOP expert can provide feedback to my comment above as well as answer if most OMOP implementations use Kill and Fill and if so are they re-setting the primary key fields each time? Thanks in advance!

DTorok · May 26, 2022, 7:59pm

Most OMOP ETLs are as you say, Kill and Fill, and given that implementation, there is no reason not to use identity columns for the primary key. I would spend some time exploring what is going on that it takes 60 hours to load a time. It would seem there is room for improvement.

MPhilofsky · May 27, 2022, 3:22pm

At Colorado University we are using a kill & fill approach each time we refresh our data. We do not reset our PKs each time we reinstantiate our CDM. We use the Google Cloud Platform and the farm fingerprint function to produce the primary keys for each CDM table.

Incremental uploads can be messy. How do you deal with changed records? Read this article: Incrementally Transforming Electronic Medical Records into the Observational Medical Outcomes Partnership Common Data Model: A Multidimensional Quality Assurance Approach by Lynch et al.

Minerva_Alvarez · October 25, 2022, 3:09pm

Hi,

Do you know any uses cases where has been implemented incrementals sucesfully? Do you know how they did it?

Thank you!!

Christian_Reich · October 25, 2022, 6:52pm

Parsa Mirhaji at Montefiore has done it.

elisa.henke · October 26, 2022, 5:42am

Hi!
together with my colleagues in Dresden, Germany we successfully implemented an ETL job from FHIR to OMOP CDM which can be run as bulk load or incremental load. Some general information about the ETL job can be found here: An ETL-Process Design for Data Harmonization to Participate in International Research With German Real-world Data Based on FHIR and OMOP CDM by Yuan Peng, Elisa Henke, Ines Reinecke, Michéle Zoch, Martin Sedlmayr, Franziska Bathelt :: SSRN. We are currently working on a paper which describes the incremental load in detail. If you would like to have more information about it yet, you are welcome to contact me or my colleague Yuan Peng (OHDSI Germany — Institut für Medizinische Informatik und Biometrie — TU Dresden).

DTorok · October 26, 2022, 1:56pm

A less formal reply. You can use SQL merge rather than insert when loading CDM tables so that only new information is loaded. Will need to re-calculate visit end date, observation period, payer plan and ERA tables after an update. With a little cleverness in what you record, this can be limited in some cases to people that had changes. Using merges instead of inserts also allows you to truncate the CDM tables and effect a total refresh.
If you really want to push the limit you can do you ETL into the Stem table, see RabbitInAHat, with views to implement the various CDM tables based upon the domain attribute of each record. This can potentially allow you to refresh the vocabulary and then update the concept mapping in the Stem table. If a source code maps into a different domain in the new vocabulary update, the domain attribute in the Stem table is changed and the view, which is based on the domain, will assign the record to the correct CDM table. However, special coding will be necessary for concepts that map to multiple concepts and for changes in the “maps to value” relationship.