"Egg on face" - a cautionary tale about ongoing OMOP Vocab change management problems

krfeeney · April 14, 2021, 9:35pm

Friends,

Over the years, we’ve had a number of threads started about vocabulary versioning and managing change (e.g. here, here, here) but it’s time to come back to this thought because I’ve personally experienced the brunt of being embarrassed by the vocab management process. Or as @Christian_Reich calls it, “egg on face” moment.

The Situation: In N3C we built a COVID phenotype that looks across conditions and measurements. This phenotype is soon to be celebrating its first birthday . Last week I was on the N3C Phenotype & Data Acquisition Office Hours, when a lovely collaborator from an OMOP site stopped by and reported this Issue: Non-standard OMOP concepts in phenotype. I was flabbergasted. I would never dare to put non-standard concepts in a concept set – @Patrick_Ryan has taught me better than this.

Well… it turns out the site was right! The Vocabulary had updated and depreciated a number of our OMOP Extensions we originally used to find COVID. This, in itself, makes complete sense. It’s understood that newer concepts may exist to substitute. What didn’t make sense: I could not after reviewing the vocabulary release notes determine which vocabulary release this depreciation became effective. When I inquired with @mik, he mentioned the Valid Start/End Dates may not correspond with the release in which this depreciation occurred.

The Underlying Use Case: I’ve been running COVID studies for the last year and have a number of phenotypes using these depreciated concepts. I need the ability to tell people when they were really effective through and in what version of the vocabulary they are no longer usable. The only way to do this today is to manually look at the delta from release to release to determine when the depreciation occurred. The problem is, I don’t maintain the Vocabulary ( those who do)… I should not have to do the dirty work of following the bread crumbs of a release. The concept table or other metadata associated to the Vocabulary files should have the ability to tell me when the concept entered into existence and when it was depreciated.

We all have stories like this. @Christophe_Lambert mentioned some of his own problems with depreciated concepts in ETL in a community discussion this year.

The Other Problem: This week I wanted to run an analysis using the newly minted J&J vaccine concepts in OMOP data. I know that J&J codes were released by CPT4 in January. What I don’t know from the concept table is when these codes went into existence. The non-standard CPT4 code has been given this record:

I understand from our tribal memory that we use the 1970s to invoke something that we don’t know when it went into existence. However, we know the time and space when this code was issued. It may be that the data from CPT4 is difficult to ascribe this date field. I can concede that.

Ok, so I don’t know when this code came into existence from its non-standard but if I look up the standard code, it’s a little better:

Still. I actually don’t know when it was released and available for OHDSI Vocab users to build concept sets. Why does this matter? I’m trying to understand the moving pieces of OMOP. If I can stratify when the concept was issued and available in the concept table, I can identify which CDMs I work with that have updated their vocabulary to include this. It allows me to know whether they may have used the concept ID, as issued, or mapped into a 0 - no matching concept.

I understand these use cases were rare before March 2020. However, it’s time to expose that we need a better system for change management in OMOP Vocabulary releases. I heard similar concerns from All of Us sites during their semi-annual meeting last week. I’m adding my friends in network research (@cukarthik @samart3 @roger.carlson @clairblacketer @DaveraG @Harold_Lehmann @stephanieshong @callahantiff @ericaVoss @Rijnbeek @MPhilofsky @gregk @Adam_Black @Andrew) who no doubt feel this pain.

The time is now to address this use case. It can’t be ignored anymore.

Adam_Black · April 14, 2021, 10:05pm

Wow great description of an important issue @krfeeney. I can share some of my realizations in this area.

The OMOP CDM is the answer to all the different languages that health data is recorded in. Through the magic of the vocab team we can now all speak one common language! But standard concepts change over time which means the words of that common language are always changing sometimes with large effect on analyses. Different datasets even in a single organization that were converted at different times use different vocabulary versions. Updating the vocab without rerunning the ETL will “break” your CDM. Incremental ETLs are difficult and relatively rare which means keeping all these datasets in the current version of the common language requires a lot of repeated computation. There is no (easy) way to download older versions of the vocabulary. The price we pay for interoperability is that the way our data is represented changes over time which at times feels like standing on shifting ground.

Christophe_Lambert · April 14, 2021, 11:37pm

Thanks @krfeeney for highlighting this issue again. What I was floored by some months back, for which I have no solution, is that many many deprecated concepts in the concept table do not get included in the concept_relationship or concept_ancestor tables. Thus numerous conditions, observations, etc. that were in use at one time (for example we see them in use in MarketScan claims data) will have no mapping to a standard term. There are over 400 such deprecated terms related to psychiatry – if you analyze data that uses those deprecated codes, you will miss whole categories of conditions and procedures that were coded.

Some ETLs, for instance ETL-CDMBuilder, don’t even map deprecated source terms to their source_concept_id, though they keep the source concept_code. Thus even if you built concept sets based on source vocabulary concept_id’s, events corresponding to those source codes will be missed – you would have to do string matching with however particular concept_codes were represented in your particular source data, which includes variants that drop the periods in ICD-9/10 versus not. Also, if you were counting on any ancestor relationships in source vocabularies to work, you would be disappointed.

I would propose that all deprecated terms have mappings in the concept_relationship table, and whatever those mappings are also fall within the concept_ancestor table.

gregk · April 14, 2021, 11:46pm

@krfeeney first of all, you had me look up the “egg on face” expression. Here is what Google says

So, apparently there are some benefits to that

Jokes asides - yeah, this is puzzle we need to crack… To be honest, there is another saying that is very applicable in this case - “the biggest strength can be the biggest weakness”. And the issue is not quite trivial - the solution is partly a process transparency and standardization and partly a technical solution (to automate the process transparency and standardization).

Not coincidently, we had a few feedback and brainstorming sessions about the whole vocabulary improvement situation, including with @Patrick_Ryan, @Christian_Reich, @Adam_Black and multiple others.

Here is what is emerging:

Issues:

Need to improve transparency and consistency of the vocabulary management process, including the current state and update plan for each vocabulary
Need to have a better tooling and automation to determine the impact on existing design artifacts and OMOP data
Need to have a dedicated place to discuss improvement ideas, prioritize vocabulary work etc…
With never ending updated to each individual vocabulary, would be good to have a bundle releases that are a) version tagged b) stored in dedicated repo
enable external contributions and more collaborative process
enable collaborative quality control

Possible solutions:

Vocabulary Dashboard that would create a broad perspective on OMOP Standardized Vocabs ontology as well as detailed metadata information on each individual vocabulary, including version, maturity state, last updated, contact person, source of data, expected update date (roadmap) and more
Dedicated Vocabulary WG
Tooling to enable automation to determine vocabulary update impact on existing design artifacts and CDM data
Vocabulary build repository that would allow to store OMOP Standardized vocab builds - at least quarterly - to ensure history is available

and many other ideas.

I guess the most critical one is to have a place to discuss feedback, solutions, prioritize things etc… - the OMOP Vocabulary WG. We will be announcing one very, very soon… In the meantime I hope that your post will gather a lot of good feedback and ideas that we could use to kick start the WG and improvement process

roger.carlson · April 15, 2021, 1:02pm

Re: Egg On Face
It appears the Vocabulary WG should address the various meanings of “egg on face”. I suggest the following:

Egg on face
Domain: Condition
Vocabulary: Cambridge Dictionary
Concept Class: Metaphor
Egg on face
Domain: Procedure
Vocabulary: TheDermReview
Concept Class: Health Benefits

roger.carlson · April 15, 2021, 1:17pm

In terms of vocabulary maintenance, this appears to be a classic problem in Dimensional Modeling: Slowly Changing Dimensions.

I’m not an expert in dimensional modeling (having spent my entire career in relational models), and for those of you like me, here is a good explanation of the topic that I have found useful: Introduction to Slowly Changing Dimensions (SCD) Types

Strikes me, the vocabularies are structured as a Type 2 SCD, but are mostly used as a Type 1 SCD. Each “higher” type, increases accuracy at the expense of increased complexity, both in maintenance and in use. What the solution is, I don’t know, but Type 6 looks intriguing.

Jake · April 16, 2021, 7:54pm

From my perspective, part of the problem is that the vocabulary generation is: (quoting the “about” section on https://github.com/OHDSI/Vocabulary-v5.0): “Currently not available as independent release. Therefore, do not clone or try to replicate. It is work in progress and not ready for replication.”

With an CI perspective, automated, replicable vocabulary build would be the first step in building tools to query and illuminate the differences in vocabulary releases, tying them back to the originating commits, and potentially using the results to cascade effects or warnings of deprecation to other tools affected by those changes.

I opened a github issue inquiring about the work required to put continuous integration techniques and tools in place for vocab generation a month ago after finding the OHDSI vocab falling short when compared to some other non-public tools, and hoping to maybe contribute to a solution. There seemed to be a lack of interest in engagement: https://github.com/OHDSI/Vocabulary-v5.0/issues/466

mik · April 19, 2021, 7:14pm

HI @Jake, not exactly lack of interest in engagement, more the use case… As @Christian_Reich has pointed out, it does not really make a lot of sense to replicate the vocabulary generation process in your own environment. For example, Concept IDs are unique and are generated within a number range on the actual vocabulary server supplying Athena with content. The logic per respective vocabulary is put on github for transparency reasons (and so that people can suggest improvements, if they wish).
The wording is maybe a little misleading. The code in Vocabulary-v5.0 is not meant to be released but reflects the current logic being applied to all the vocabularies listed.
The versioning issue (and traceability) is indeed something we need to cover. The model should be able to provide the Type 2 SCD approach as mentioned by @roger.carlson with keeping a link (relationship as mentioned above by @Christophe_Lambert) from a deprecated item to its active successor replacing it, including proper start and end dates. Let me consult with @clairblacketer and @krfeeney again, where this works already and where we desperately need improvement.
So, coming back to your CI comment, I welcome your input on improving the vocabulary build to produce robust results that allow traceability of changes. There are tools like TANTALUS that create a delta view between vocabulary versions. We have however not yet adopted these into the vocabulary release process and are currently not keeping historic vocabulary versions.

Jake · April 19, 2021, 7:57pm

Thanks for the reply @mik. I think there is a misunderstanding, I wasn’t trying to suggest that it would be common for OHDSI members to want to build the vocabularies themselves.

I would, however, expect that someone (perhaps like myself), sufficiently interested in helping the vocabulary team maintain vocabularies and improve CI / traceability of the vocab build process, should be able to replicate the stuff that happens over at Odysseus and ends up on Athena and in CDMs. I appreciate the current transparency and realize that it may be asking a lot to connect the dots for this “last mile”. I understand there are some large barriers, probably even some I’m not aware of! But I’m also offering whatever help I can give, and the issue I opened was a request for information on what may be needed (and how we might proceed).

I think the future success of OHDSI relies on the community being able to replicate this process when needed, opening up the possibility for contributions and easing of the burden on those who know the magical incantation needed to animate PALLAS :).

Jake · April 23, 2021, 12:58pm

@mik, @Christian_Reich, others, what would be the right way to engage y’all about helping to get the vocabulary build process into a shape that it could be run / recreated by CI tools (and interested community members), opening up the possibility for contributions towards fixing some of the problems posed in this thread?

Vojtech_Huser · April 23, 2021, 2:22pm

as FYI - a related issue with end and start date for concepts in ICD10CM is described in this forum post: ICD10CM start and end dates for retired and new concepts

Ad concept sets - I think like the phenotype group managing phenotype, having a structure of managing concept sets would be nice. A re-usable, global way to reference a concept set. In Atlas once I import a given concept set, it is not a dynamic link. By updating the linked concept set, I could update multiple phenotypes.

Christian_Reich · April 23, 2021, 2:34pm

Thanks, @Jake. Right now, we do that by letting folks into our box, rather than having multiple boxes in parallel. Reason is that the amount of work reproducing it would be large. But doesn’t mean it couldn’t be done.

What do you have in mind?

Jake · April 26, 2021, 12:56pm

@Christian_Reich, Great! I’d be interested in meeting with whoever runs the build and release process or has a dev ops focus over at Odysseus, learn more about the current process, and brainstorm ways we could virtualize / standardize this environment and take advantage of some of the github CI tools. The end I currently have in mind is automated build and diff generation on vocabulary release with persisted artifacts.

Jake · May 5, 2021, 7:46pm

@Christian_Reich, @mik, thoughts? Is there a better way to continue this conversation with y’all?

krfeeney · May 5, 2021, 7:51pm

@Jake, I haven’t forgotten about you. I started a new role this week so I am a little slow. @clairblacketer and I are working with @mik to help the Vocab WG get more hands and feet so he can hold public meetings. Ping me your OHDSI Teams name and I’ll make sure we add you to the channel!

Jake · May 5, 2021, 7:54pm

Thanks! Sent you a ping on Teams. Sorry if I came across as inpatient, just don’t want this to fall off of my radar

MaximMoinat · May 19, 2021, 12:32pm

For future reference: this topic has been debated during the OHDSI Community call on May 11th 2021. See OHDSI Community Calls – OHDSI and direct link to the debate: https://youtu.be/GvQ2VX0kAqI

(after the debate, a majority of the audience was in favour of aligning the OHDSI network on a common vocabulary version)

silviodc · August 17, 2021, 11:17pm

I have following the discussion, it looks very interesting.

The only way to do this today is to manually look at the delta from release to release to determine when the depreciation occurred.^

Perhaps there are tools that could reduce this work. Like: https://www.dynaccurate.com/