OHDSI Home | Forums | Wiki | Github

Quality certification for ETL - Polling interest


Recently, we had a bunch of industry folks come together, hosted by J&J, to discuss OHDSI and what is needed to help adoption and drive use cases. We ended up with a good list of challenges. I am sure the public sector has exactly the same kind of belly ache, so it makes sense to me to deal with this at the community level.

The most pressing need folks expressed was ensuring quality of data in CDM: How can anybody be sure that the conversion hasn’t created havoc, and the result of an analysis can be trusted. One way of doing that would be to establish both the content and the service of an ETL certificiation.

Is anybody thinking about this right now?


We have done QC on ETL processes done by others. The main finding from that project was that there was a lot of variation in the ETL when the CDM guidelines were anything less than perfectly specified. There was variation in the vocabulary version used. Variation in the definition of “visit”. Variation in how codes were moved and in the use of concept_id == 0. So, part of it has to come from OHDSI itself. There is already another post on the forum about the 86 page guide that PEDSnet had to put together to ensure consistent ETL. Standardizing weight data

Our approach has been to build a “rabbit in a hat” type of design specification and then automated software that will implement it. In this way, we QC the software, and develop/QC the ETL spec. To accomplish this, we moved to an intermediate data model that we call a “generalized data model”. This focuses on the re-arrangement of data and provenance, and avoids visits and most vocabulary mappings. From this point, we have a second ETL for the specific data model. That is how we are doing it, but it doesn’t have to be that way.

The main point is that

  • OHDSI (perhaps in collaboration with ETL vendors) needs to provide detailed implementation guidance before anyone can certify anything
  • the process needs to be automated to generate evidence of proper testing and QC
1 Like

Great points @mark_danese. We’ve thought a lot about quality in the VA and
found that incremental updates from an EMR require extra care (because the
source data can actually change since the last load). We have several steps
of data mapping validation, row count checks, referential integrity checks,
and comparing cohorts in the source data vs OMOP transformation. We plan to
submit our QA process to the symposium. Might suggest others do the same -
or maybe some part of the symposium discussion / tutorials are a roundtable
discussing quality.

1 Like

Hi Christian,

I would applaud an effort in this direction! We are currently working on a proposal for an ‘OMOP mapping quality scorecard’ that for every key domain displays source and target vocabulary, number of unique concepts, the occurences and terms mapping coverage and a few other easy to calculate yet informative statistics, and trying this out to see if we can arrive at a reliable and useful ‘score’. We can possibly even work out more sophisticated quality indicators like the granularity / vocubulary level terms are mapped too, although this won’t be easy to generalize.
The cool thing about this is you should be able to monitor the score between subsequent runs on the mapping, and also be able to somewhat objectively compare ETL mappings between projects.

I actually had on my action list to contact you to propose a Working Group for this so this timing is perfect (and not entirely coincidental :smile:) Obviously this is a hot topic in EMIF now that we’ve done quite a few mappings of European data sources in the consortium. @mvspeybr @MaximMoinat



@keesvanbochove et al.

You are very appropros. We (@Asha_Mahesh, Minie Chou and @gregk) want to call for a new WG called THEMIS, after the Greek Goddess of divine order, fairness, law, natural law, and custom. It would standardize all the nitty-gritty details and conventions, and hopefully get the CDM to a standard that makes certification and automatic quality checking much easier. It would be wonderful if you guys could help with your European perspective, and also get other Europeans in. We don’t want that to be a US-only activity.

Let me put out a separate Forum posting.

I have added some limited unmapped data rules and checks into Achilles.

For example here:

and here

@scottduvall, @keesvanbochove, @Vojtech_Huser:

Please look at this, which may be interesting for you.