OHDSI Home | Forums | Wiki | Github

OMOP CDM ETL Update Strategies: Incremental vs. Kill and Fill

We are on the CDM5.3 and I just recently started supporting the ETL and was not part of the original implementation team. We are implementing in Azure.

For the most part, the ETL is a kill and fill process in the staging tables; and then definitely for all CDM schema tables. Additionally, the primary keys of most of the clinical tables use the INDENTITY(1,1) property, so all keys are re-assigned with each load. Some of the largest tables take between 8 and 60 hours to load.

And now, the team is considering using an incremental load. I was told by the members of the original implementation team that the design intension is for the old database to remain available so researchers can save the cohorts. I wrote a query on the COHORT and COHORT_DEFINITION table and both are empty!

I hope an OMOP expert can provide feedback to my comment above as well as answer if most OMOP implementations use Kill and Fill and if so are they re-setting the primary key fields each time? Thanks in advance!

Most OMOP ETLs are as you say, Kill and Fill, and given that implementation, there is no reason not to use identity columns for the primary key. I would spend some time exploring what is going on that it takes 60 hours to load a time. It would seem there is room for improvement.

At Colorado University we are using a kill & fill approach each time we refresh our data. We do not reset our PKs each time we reinstantiate our CDM. We use the Google Cloud Platform and the farm fingerprint function to produce the primary keys for each CDM table.

Incremental uploads can be messy. How do you deal with changed records? Read this article: Incrementally Transforming Electronic Medical Records into the Observational Medical Outcomes Partnership Common Data Model: A Multidimensional Quality Assurance Approach by Lynch et al.