OHDSI Home | Forums | Wiki | Github

OMOP CDM refreshes and versioning

(Gregory Klebanov) #1

I would like to share and get other’s ideas/opinion on the best practices around OMOP CDM database on-going refreshes and versioning. The refresh does impact any analysis that is currently inflight as well as on any existing stats that have been generated so far so these should help to minimize the impact on any active studies.

I see that the following has emerged as a pattern:

  1. Full refresh (quarterly / bi-annual) - includes new raw data, potentially raw schema changes, ETL scripts and OMOP CDM schema upgrades, OMOP vocab. updates: create a new OMOP CDM instance.

  2. Light refresh (monthly) - includes only updated OMOP vocabs and data updates but without raw data schema changes: perform refresh in existing schema.

Also, the pattern is to keep at least two instances of data - latest version and minus one. For example Q2, 2018 and Q1, 2018.

Please let me know how you tackle this problem in your organization.

@Christian_Reich, @mvanzandt - would we consider this to be the THEMIS topic?

(Steve Patterson ) #2

@gregk - Did this perhaps get answered privately or in another thread? I’m interested in what others are doing with refreshes/versioning.


(Ajit Londhe) #3

@rtmill and I were just speaking about this during the DQ tutorial. We feel there should be a standard OHDSI way to tag CDM versions, rather than rely upon sites to handle this with their own method. If we have a standard versioning/lineage model, we can more easily establish features in Achilles or Atlas to compare source version n-1 against source version n.

In Janssen, we have a load_id field that is fulfilled with a unique id and is tagged for every new database we build. This load_id then corresponds to a row in the load table, which has foreign keys to a vendor and a vendor_schema table.

This approach allows us to link versions of a specific vendor’s specific database product together. Additionally, it helps us with our study protocols and manuscripts, as we always know which exact CDM build(s) was used to create that result.

I think we should add a unique key approach to the CDM, specifically in the Metadata schema that we are designing. As part of the design proposal, we have a standard concept for “CDM Load Id” that would be used to represent the fact that you are storing the CDM Load Id. As a new THEMIS convention, everyone should store a record in the metadata schema with that standard concept and the id’s value.

Further, perhaps as part of the WebAPI repository, a source lineage table or tables could be added that can help with linking CDM versions together, so that we could show some useful DQ trending information about multiple versions.