Share Your Vocab/Concept Updating Process

mjg · February 16, 2021, 6:28pm

Hi OHDSI Community!
I’m very curious if there are organizations that could share their strategy/process around updating their vocabulary/concept tables. I’m going to be attempting to create this kind of process, but I think it could be helpful to see if others can share their experience.
As I see it in its simplest form, I’ve created a set of vocabularies I’d like to include in my download in Athena and checked the box to be notified of any changes to any of the vocabularies. Every notification of an update, download those files again, take those csv files and do a kill and fill into the 9 vocabulary and concept tables.

Has any one done historical tracking of the vocabularies? Versioning each download? Audit history?
Do you download all vocabularies every single time or have you created separate downloads?
Do you keep track of any addition/changes between vocabulary updates?
Are you updating right away? Do you download on a cadence?

Please share any of your thoughts on this as well.

Thanks!!!

DTorok · February 16, 2021, 9:01pm

We only update the vocabulary about once a quarter. Download all the vocabularies of interest into a new schema and then run a full ETL using the new vocabulary tables. The domain_id of a concept can change between releases, meaning that, in the CDM, an event row, such as an Observation should now be in Condition Occurrence.

jposada · February 16, 2021, 9:21pm

hi @DTorok,

Thank you for the answer. How are you dealing with mappings that may map to a concept that is now deprecated?

DTorok · February 16, 2021, 9:46pm

Because we run a full ETL after updating the vocabulary we do not do anything special. If the source concept is deprecated it is usually because it has been retired. We do not test the invalid_reason when looking up the source concept because the source code was valid at the time it appears in the source data. The ‘Maps to’ relationship is maintained even if the source concept is not longer valid. And if it is the target concept (concept_id_2) that is deprecated we trust that the ‘Maps to’ relationship had either been deleted if there is no valid target, or updated to point to the correct target concept.

MPhilofsky · February 16, 2021, 11:09pm

Colorado downloads all chosen Vocabularies once a week. Once the CDM has been updated, we archive the old Vocabularies. We do not have an audit history, keep track of additions/changes, or do any historical tracking.

We also believe in the process and do as Don does:

If there is a hot fix or urgent need, we could update sooner, but once a week seems sufficient for our most pressing use case, which is COVID-19.

mjg · February 17, 2021, 1:59pm

Thank you @MPhilofsky and @DTorok for sharing! Keep them coming if anyone else has anything to share or add.

@DTorok, is your dataset growing? If it is, your ETLs probably become longer each quarter.

@DTorok, @MPhilofsky, With trusting the vocab update process, how often would you say that a concept of interest has changed in terms of its standardization, domain, relationships, etc.?

Dermot_Doyle · February 17, 2021, 2:19pm

Hi @mjg

I’m copying @Alexdavv on this as we’ve been suggesting running a trial of a new AI which is designed to remap clincial terminologies and controlled vocabularies at scale, as well as keep a complete history of the changes (i.e. the evolution).

Currently we are thinking of testing it against the ICD/SNOMED mappings in OMOP. You can see that conversation in Some ICD10CM codes are not mapped correctly

The AI is called www.dynaccurate.com - and I think it might interest you as well.

katy-sadowski · February 17, 2021, 2:23pm

Hi there! I’m quite interested in this topic as well. My organization has not figured out an elegant solution to this yet.

For now, any time we do a full re-run of our ETL (which we need to do in order to update the vocab), we move the previous dataset into an archive schema. We can thus use the archive to perform ad hoc comparisons to see how concept sets/cohorts were affected by the update.

One idea I’ve had would be to somehow leverage the cohort comparator in CohortDiagnostics to do the above comparisons in a UI - that would take some figuring out, though

MPhilofsky · February 17, 2021, 7:06pm

We don’t keep track of these changes. This would be a good question to ask on the Researchers forum where a domain or relationship change will affect their query, cohort and/or concept set.

jposada · February 17, 2021, 9:08pm

We have seen heavy updates when changing a vocab. This is, records dropping because of lack of Maps To relationships or a concept that has now fewer concepts to map to. We update the vocabulary separately from the ETL. We do a full inventory of which concepts are in every table and check whether they are in the new vocab. A lot of surprises there. We may want to brainstorm a formal process for the entire OHDSI community that syncs with changes brought by the vocabulary team

mark_velez · February 17, 2021, 11:57pm

Hi @mjg . The All of Us Research Program relies on regular updates to the vocabulary for two disparate feeds data: survey data and EHR data. Ultimately these two data feeds wind up in the same data product so they must be associated with a single vocabulary release (i.e. version).

The project develops several survey instruments which are modeled using the OMOP vocabulary in a custom terminology called the PPI. An updated release is issued ~4 times a year- although recently there might’ve been more frequent updates due to new COVID surveys.

AOU depends on several terminologies in order to consume EHR data submitted by dozens of healthcare organizations around the country. When we first started we endeavored to refresh the vocabulary maybe twice a year to keep things up to date, but we’ve had to do things more frequently at times (e.g. whenever desirable COVID-related concepts got added).

One downside we’ve identified is we need to do a full refresh of the vocabulary every time. My understanding is this is because there may be intricate relationships that could be difficult to tease out if we wanted to say, just update the PPI and nothing else. Much to their credit, Odysseus did recently give us some SQL scripts to try such an “in-place” update, but we have not had a chance to try this out yet .

Has any one done historical tracking of the vocabularies? Versioning each download? Audit history?
We keep all historical versions of the vocabularies that have been in use. We also have internal concepts that never make it to Athena- we automatically assign a version (i.e. in vocabulary.vocabulary_version) by creating a hash out of the input data.
Do you download all vocabularies every single time or have you created separate downloads?
We download all vocabularies every single time. We used to rely on a single “bundle” in Athena. It automatically gets archived on the website- so we would “restore” it. But we recently discovered that some of the OMOP Extension concepts had disappeared from the download and would only re-appear after we created an identical bundle in Athena (i.e. with the same vocabularies checked off).
Do you keep track of any addition/changes between vocabulary updates?
As you can imagine for the PPI we are actively tracking changes because we are authoring the terminology itself. There are changes we wished we would keep better track of. For example, there is an implicit hierarchy dictated by concept_relationship records associated with PPI; we are not sure how well this is being maintained. Similarly with relationships between questions and answers. There are also instances where codes the PPI team submitted for addition were not compliant with OMOP vocabulary / Odysseus practices and so they had to be modified.
Are you updating right away? Do you download on a cadence?
We do not update right away. There have been occasions where the data validation routines we apply to EHR data submitted to us would be affected by a vocabulary update. It’s also kinder of us to ensure sites who are submitting data have access to the vocabulary release we will be using for evaluating their submissions. No magic bullet here (that I know of)- just careful communication.

Ok I’m going to stop myself here. Happy to do a web call or something if that might be useful.

mark_velez · February 18, 2021, 12:09am

For some time -and against better advice- we were dependent on concept_ancestor records in order to get a hierarchy of LOINC concepts. These records just sorta disappeared from the vocabulary download and required us to add an additional step of reintroducing them to our vocabulary update process. Then we recently learned that such a hierarchy could be obtained by using component of / has component relationships (I believe these are associated with what LOINC calls “properties.”) We just sort of “discovered” this and would like a better way to become aware of things like this. I’m ~~paraphrasing~~ butchering something @ChaoPang told me earlier, so copying him so he can correct me as needed.

mjg · February 18, 2021, 2:41pm

Thanks for sharing your experience! I think I share some of the frustrations you’ve mentioned.

I’m surprised about this! I’ll have to keep this in mind.

mjg · February 22, 2021, 2:46pm

AFAIK this https://github.com/OHDSI/Vocabulary-v5.0/releases/ is a place to see updates of vocabularies. Would you be able to elaborate on what else you do?

It would seem that the link I provided shows a high level set of changes that any user can take a look at, but I could also see a more in depth look at the changes as you mentioned. Would your checks include things like what concepts have changed “maps to” relations, domain changes, specific new concepts added, etc.?

Christian_Reich · February 23, 2021, 9:29pm

Can you list them, @jposada?

We are planning on exactly that. Stay tuned. We need to either usurp an OHDSI Tuesday or call for one at another slot.

jposada · March 2, 2021, 3:42pm

hi @Christian_Reich,

thank you for your answer. Looking forward to that OHDSI call.

In the meanwhile I am still collecting all the issues to post them here.

MPhilofsky · March 2, 2021, 5:16pm

The Tantalus R package was discussed during the ETL breakout session of the community call today. Per the GitHub, “This is an R package to help you expose differences between two vocabulary versions.” This package should answer some of the questions posted above