Vocabulathon 2025: Precise mapping

Dymshyts · August 26, 2025, 1:24pm

This is a tread of subgroup of the Vocabulathon

Let’s prepare before the Vocabulathon, doing some work offline, so then we’ll get a productive 4 hours in the meeting.
If you’re interested in solving this problem, please answer

Please outline the use cases and problems with non-precise mappings.
How it affects your phenotypes and research in general.
Which vocabularies are you working with?
If you have ideas how to solve this problem, please share them
How can you participate in the discussion (only online, or also in-person during the Symposium)?

In the end of the meeting on October, 7th, we will know

what type of pain this problem does to different organizations
possible solutions to this problem, and hopefully can chose one, we can bring up to the OHDSI steering committee.

Tagging @katy-sadowski @Andy_Kanter who already showed their interest,

@Gowtham_Rao @Chris_Knoll @Christian_Reich @MPhilofsky @abedtash_hamed who actively participated in the related discussion a while ago.

@aostropolets @Vlad_Korsik @zhuk @m-khitrun @Polina_Talapova @Eduard_Korchmar as the vocabulary team

Dymshyts · August 28, 2025, 8:46am

I’ll start with answering my questions:
Use cases

Couple of examples:
Spotting in a first trimester pregnancy.
when source is mapped to 2 concepts, one concept is chosen as index event, another - as inclusion criterion happening at the same day. - not that bad when we look at one concept.
But if want to build the complicated phenotype definition, such as ‘Any malignancy excluding non-melanoma skin cancer’, it becomes more complicated.

and we can’t do that simply by excluding all descendants of ‘disease in remission’, because not all cancer in remission is mapped to one code which has ‘disease in remission’ as a parent, so I listed the source codes instead.
we can’t easily exclude non-melanoma of skin, because it’s mapped uphill to malignant neoplasm

Or we can’t exclude the Migraine with cerebral Infarction from the cerebral infarction phenotype because it’s mapped both to the Cerebral infarction and Migraine with aura. (Migraine with infarction is a confusing condition, and clinicians suggested to remove it).
In theory we can say ‘no migraine on the index date’, but it becomes too complicated for the users as they can’t track the resulting set of source concepts included.

How to solve it
The idea I like the most: to make ICD10CM concepts standard if they don’t have SNOMED equivalent and if they represent distinct clinical case, which means, other and unspecified terms will not be standard.
Then these concepts will get Is_a relationship to the concepts they have Maps_to now.
All those source concepts that are mapped to several concepts and have distinct meaning will become standard as well as concepts that are mapped uphill. We can detect concepts having uphill mapping using LLM.

Then, the other ICD ontologies can be mapped to ICD10CM or SNOMED.
Note, I mostly work with the US data, that’s why I might be biased, and I’m open to the another candidates to become the standard terminologies.

Christian_Reich · August 28, 2025, 11:14am

@Dymshyts:

(you are hyperlinking to epi.jnj.com/atlas. We cannot see that).

Not understanding your use cases.

Spotting in first trimester pregnancy: Cohorts need to be built with two separate criteria at any rate. Because the data could contain spotting and pregnancy separately. Not sure we need this combo concept at all.
Malignancy except non-melanoma skin cancer: This is a combination concept that isn’t even properly defined, as “non-melanoma skin cancer” is not a thing. It is the same problem as “NOS”. Why can’t we build a cohort with “malignant neoplasm” and descendants plus excluding “basal cell carcinoma of skin” and descendants? Like in 1, we have to do that anyway.

Remission is an Episode. It should not be used as an attribute of a disease concept, because, as you said, there is no way we will ever have all cancers pre-coordinated with “in remission” or “in progression”. These are so-called “Disease Dynamic” Episodes.

Migraine with infarction: Again, the separate concepts need to be in and excluded anyway, because the data might contain them separately.

Bottom line: You are providing several categories of problems with mapping of complex concepts:

AND-combos (spotting and pregnancy): just split them up and create separate inclusion criteria.
AND NOT-combos (non-melanoma skin cancer): do they actually exist as concepts? If they do, no mapping will fix that.
Combination of attributes that live in different domains: These have a problem if there is no way to link them (which in cancer we put in place). But if they don’t have a link mechanism, the only solution I see is OMOP (or SNOMED) Extension.

Dymshyts · August 28, 2025, 11:42am

thanks @Christian_Reich
I fixed the link.
Cancer excluding non-melanoma skin cancer wasn’t a concept but a phenotype we had.
Probably I need to find better examples where it’s one clinical idea concept is mapped to several or to one concept with losing of significant information

Vojtech_Huser · August 28, 2025, 9:39pm

A related note: (not fully reply to your thread)

SNOMED CT at some point concluded that in addition to terminology, there is a need for grammar for ‘expressions’

See Compositional Grammar - Specification and Guide - Compositional Grammar - SNOMED Confluence

examples
https://confluence.ihtsdotools.org/display/DOCSCG/6.5+Expression+With+Nested+Refinements

Then you need a reasoner to possibly conclude that your expression is the same as formal concept. (or descendant of it)

katy-sadowski · August 31, 2025, 2:00pm

Thanks for starting this thread, @Dymshyts ! Count me in.

My (and my team and Boehringer) use case: I want to create a concept set using standard concepts for a given condition. However, there are source concepts mapped directly to the “root” standard concept for that condition which I do not want to include in my concept set. This often occurs when specific ICD10-CM codes are mapped to a less-specific standard concept. In this case we are forced to create a source concept set for that condition I can compile a list of specific examples ahead of the Symposium, if it’d be useful.

Solution ideas: I like the idea of using LLMs to evaluate existing mappings and propose corrections and/or new extension concepts as applicable. The most recent GenAI Workgroup meeting discussed this use case:

I will be happy to join in person at the Symposium (and hopefully @Ajit_Londhe @mdlavallee92 and others from our team can too!).

Dymshyts · September 5, 2025, 10:37pm

It will be super useful:) Thanks @katy-sadowski !

Dymshyts · September 16, 2025, 11:54am

Another use case:
We need to define leukopenia, and
D72.819 Decreased white blood cell count, unspecified fits well as according to the ICD10CM it’s Applicable To

Decreased leukocytes, unspecified
Leukocytopenia, unspecified
Leukopenia

While the D72.818 Other decreased white blood cell count doesn’t fit the leukopenia as it’s Applicable To specific cells decrease:

Basophilic leukopenia
Eosinophilic leukopenia
Monocytopenia
Other decreased leukocytes
Plasmacytopenia

In OMOP we always mapped “other condition X” and “unspecified condition X” to the same “condition X” concept, leukopenia in this case.
And currently we need to use source codes to distinguish between these two conditions

Dymshyts · September 26, 2025, 1:44pm

And a couple of other potential use cases:
https://athena.ohdsi.org/search-terms/terms/725460
Measurement of Transaminases is mapped to Measurement of liver enzymes in general (it doesn’t make a practical problem, as it’s the only source code that’s mapped to ‘Measurement of liver enzymes’)

all these codes are mapped to the one code: 372629 Nonexudative age-related macular degeneration, while on a practice we need to distinguish between these codes (for example, we are interested in Geographic atrophy
which is an advanced stage coded as H35.31_(3|4) , not all these codes