Vocabulathon 2025: chipping away at the Bummock

Hi Friends!

As many of you know, vocabulary work in OHDSI has always been one of the toughest, but most essential, pillars of our ecosystem. Over the years, we’ve accumulated a long backlog of tricky, deep-rooted issues. Some of these challenges are visualized in what we call the bummock of the iceberg (see image below), and unfortunately, this part of the iceberg is mostly frozen in 2025.

But that doesn’t mean we should wait.

Since centralized vocabulary development will likely remain maintenance-focused through 2025, the only way to move forward is for us, the community, to come together and take initiative. And what better way than through something we all enjoy in OHDSI – the athons!

We’ve done many focused athons in OHDSI: Studyathons, Phenotypeathons, Devathons, and so on. But one thing we’ve never done, despite frequent promises, is a Vocabularyathon. It’s long overdue.

What’s the Format?

The proposal here is to host the first-ever Vocabularyathon during the 2025 Global OHDSI Symposium in the US.

Here’s how it could work:

  • Participants will join teams based on vocabulary topics they care about.
  • Each team will be facilitated by vocabulary experts, who’ll help frame ideas in line with OHDSI conventions and vocabulary principles.
  • Ahead of the Symposium, teams will prepare an outline of their problem space.
  • During the Symposium, we’ll use in-person time for brainstorming design ideas and potential solutions.
  • We don’t expect the problems to be solved on-site. Many will require substantial follow-up effort, but we aim to produce shared understanding and direction to start up long-term collaboration within these groups.

Importantly, this isn’t a place for local or disease-specific issues. We’re focusing on the model, design, and vocabulary-wide issues. Think structural improvements, not national or project-specific gaps or one-off mappings.

What’s on the Table?

Here’s a list of core topics from our backlog, along with contributors who’ve already raised or worked on these issues. If you see yourself here or want to add something new, speak up below or join us during the Vocabulary WG session at the European OHDSI Symposium this Sunday!

1. Races/Ethnicities

@Jake, @MPhilofsky and @piper proposed combining Race and Ethnicity domains into a unified one, using broader CDC categories and multiple source vocabularies to reflect nationality and mixed populations. The challenge is whether we actually try to create the shared definitions in OMOP and de-dup/map concepts based on them.

2. Procedure modifiers

@piper, @DTorok, and Themis WG discussed cleaning up CPT/HCPCS modifiers, potentially removing the modifier_concept_id field and restructuring modifiers by domain (route, laterality, etc.).

3. CPT4/HCPCS decomposition

VA contributors, @Dymshyts and @MPhilofsky debated how composite visit codes lost semantic richness after mapping to generic Visit and other domain concepts. Ideas emerged to preserve detailed Observations alongside them to ensure domain consistency, vocabulary principles and global (ex-US) adoption.

4. Missing & reused NDCs

Many folks flagged frequent gaps in NDC coverage and ambiguity from reused codes. Solutions may include better use of date logic, missing and unmapped code collection, with further contribution to Vocabularies.

5. Devices

@AsiyahFDA, @mmatheny, and the Device WG are pushing to support global device vocabularies, not just FDA UDIs, and to represent manufacturer/model attributes flexibly.

6. Rare diseases (Orphanet, Monarch)

The Orphanet team (Ana Rath, Marc Hanauer) and the German friends (@Michele_Zoch, @mik) led the mapping of ORDO, seeking formal vocabulary inclusion and robust mappings to SNOMED/ICD. The Monarch initiative (@mellybelly and @Bryan_Laraway) demonstrated how Mondo and the Human Phenotype Ontology (HPO) can be used to algorithmically identify rare disease cohorts in OMOP data, as shown in N3C. The new Rare Diseases WG, co-led by Xiaoyan Wang and @chunhua, is aiming to advance the Rare Disease research in OMOP which eventually relates to the vocabulary problem.

7. Complex mapping constructs

@Christian_Reich, Oncology WG, and the vocabulary team explored wide mapping tables and relationship groups for decomposing pre-coordinated concepts into multi-column outputs – ideal for surveys, labs, registries, etc.

8. Surveys & EAV-type data

@FrogGirl, the Survey WG, and the vocabulary team explored how to represent survey/questionnaire data without over-standardizing instrument-specific logic. Ideas include de-standardized storage with mapping to pre-coordinated concepts in respective domains.

9. RxNorm Extension & Drug domain quality

@aostropolets, @Eduard_Korchmar, and European teams (@mdewilde, @toms, @freija) have been looking to advance RxE coverage by integrating non-US drug sources, resolving overlaps with RxNorm and improving the Drug domain structure and approach.

10. Units to Measurements

@Vojtech_Huser and @MPhilofsky proposed standard units per measurement concept, and linking via relationships to support quality checks and unit conversions. @Ahmed-Medhat-Zayed flagged 100+ UCUM units with syntax issues, proposing corrections to improve machine-readability and support automated ETL workflows.

11. Vaccine vocabulary

Oliver He, Jie Zheng, and the Vaccine WG worked to build a hierarchical vaccine ontology (VO) in OMOP, aimed at classifying the existing standard vaccine concepts.

12. OpenMRS

@Andy_Kanter, @grace_potma, and collaborators from the African Chapter initiated mapping of the OpenMRS model and dictionaries into OMOP CDM to support broader global EHR inclusion.

13. Microbiology & drug susceptibility

@cukarthik and @Christian_Reich have been discussing the better representation of culture results and susceptibility tests – either through new tables or better concept design.

14. That one issue you’ve always wanted to bring up

You know the one. The vocabulary quirk that’s annoyed you for years, the modeling gap you’ve tiptoed around in every ETL, or the debate you’ve had three times already in chats. This is your chance to finally surface it. Propose it below and gather a team. No idea is too unfinished. Let’s refine them together!

Who Can Attend?

The short answer: everyone.
While many of the Vocabularyathon topics do require deep-dive expertise in OHDSI vocabulary structure, many tasks don’t. There’s plenty of work that just needs domain knowledge and willingness to collaborate.

For example:

  • Gathering information about source vocabularies that could be added (new drug, procedure, race/ethnicity standards) doesn’t require technical vocabulary expertise. It’s about systemizing the right sources and healthcare systems.
  • Compiling and comparing drug dictionaries and their structures from different countries or projects is valuable groundwork for the conceptual approach to the Drug domain further extension.
  • Exploring how ATC classifications are used (or misused) across data partners is another area where hands-on experience with datasets matters.

What can You do?

  1. Speak up in this thread.
    If any of the above topics resonate, or you have a different general (not local!) vocabulary issue, add your name and ideas below. We’re shaping the issue/team lists from now on.

  2. Join the Vocabulary WG session this Sunday at the European OHDSI Symposium in Hasselt.
    We’ll discuss this initiative live and start organizing further actions.

If you’ve ever thought, “This vocabulary problem has been sitting unsolved forever…”, then maybe this is our chance. Let’s make progress on the bummock together!

Looking forward,
Alexander and the vocabulary folks

2 Likes

Lots of tags here. Let me go through in numerical order:

#1. There isn’t a global definition for race or ethnicity. In the US, “black” is a race. In the UK, “black” is an ethnicity. Therefore, we need one, combined “race-ethnicity” vocabulary.

There also isn’t a global definition of “black” or any other race/ethnicity. So, we can’t offer the community a definition.

What we can and should offer the community is the ability to utilize their source race & ethnicity values as standard concept_ids so the researchers and investigators can utilize these in their concept sets and cohorts when they build their study and disseminate the results. @Piper-Ranallo and I just completed this work and submitted for including in OHDSI vocab version August 2025

De-duplication on exact lexical matches (black = black), along with identifying synonyms (Native Village of Aleknagik = Aleknagik) was completed as part of our submission.

#2. The vocabulary team already restructured the modifiers into domains. This seems like more of an ETL/CDM WG/Themis topic. What use case are we trying to solve? I’m only aware of the 1 Procedure with many modifier issue.

#3. I thought the CPT4 changes were reversed? What more needs to be done?

#10. My particular interest is with vital signs. I have conversion algorithms to international standards for the most common vital signs and their units of measure.

#13. @QI_omop, @schillil and others have a sub-group in Themis discussing use cases. Let’s invite them.

#14. Let’s strongly encourage new collaborators to not use the STCM. Not only is it the black box of mapping, it also adds more code to the ETL. This doesn’t need discussion, this needs documentation and a clear community voice.

I have to send my regrets for any WG activities at the global symposium. I have to leave early to talk about #1 at another symposium. I’m happy to assist in the prep work and provide any materials, information or opinions on the topics above.

@Alexdavv thank you for you proposal,

Probably, what bothers me the most is a complexity of condition mapping:
mapping to multiple concepts and “uphill” mapping often make concept set creation quite a puzzle sometimes, so I even need to use non-standard codes to define condition of interest.

This problem also prevents some organizations to use OHDSI vocabularies, I think.

And the proposed format will work well:
we’ll bring our models and methods, and then we can compare them and decide which one is better.
Something like: we make ICD10CM as another partially standard vocabulary, if there’s no mapping to SNOMED, we make ICD10CM concept standard, other ICD condition vocabularies will be mapped to SNOMED or ICD10CM.
Evaluation of the mappings will be made using LLMs.

Great idea @Alexdavv !

@Dymshyts I love this idea and would be happy to pitch in. I have many examples of problematic concept sets of this sort. Starting from ICD10-CM codelists (not by choice… :upside_down_face:) often reveals incompatibilities between SNOMED and ICD and other issues like uphill or questionable source-to-standard mappings. It would be great to characterize these issues and propose solutions that help improve the capture of ICD-based data in OHDSI.

2 Likes

Lovely post @Alexdavv , thanks for your leadership in bringing this vocabularython to life. I want to reaffirm your invitation to everyone: while not everyone may have the expertise to develop our full vocabulary system, all of us are vocabulary users and have all experienced challenges in our use. This is a great opportunity to surface those challenges and collaboration toward the design and implementation of solutions.

I saw several exciting collaborator showcase submissions at ohdsi europe and ohdsi global that are centered on use of llms to improve mappings, but a key reflection for me is that we have too many people independently trying to solve the same problem, and not enough collaboration across groups in design, implementation, and evaluation once these shared problems are identified. I hope event can help improve our community in this regard.

2 Likes

There is a group of us working on an immune ontology that would be relevant to the Vocabularyathon. I am not sure how much progress we’ll make by that time, but anyone interested in autoimmune disease definitions, naming, and organization should contact me so I can connect you to our group (the group’s domain is actually all of immunology, not just autoimmune diseases). The group is mostly academics and mostly centered at U Buffalo, but we believe that collaboration with the OHDSI community will be essential. If anyone has questions about what we are doing, please email me at aaron@autoimmuneregistry.org.

1 Like

Thanks @Alexdavv - I would like to participate in group #7 to collaborate - and help break down pre-coordinated concepts and relationship groups.

Making a pun on “vocabulary” - how about vacabul-a-thon ? :slight_smile: (dropping the “ry” and combining the two a)

2 Likes

Hi @MPhilofsky and thank you for your input!

Let me promote the related forum post and encourage people to review and speak out!

The only concern left and expressed by @Christian_Reich was that avoiding definitions would actually mean zero deduplication work, zero mappings, and synonyms even for the lexical matches, because the same term could mean different things based on different contexts. While it seems there’s quite some synonyms, mappings, and deduplication steps. I’d let Christian continue the discussion in the mentioned forum post, and maybe in the format of the Vocabularyathon :smiley:.

Right. We discard them from the clean Domains (yeah, relatively clean Domains :blush:). The next step could be a classification and clean-up so that we can actually use them in studies. Here’s the problem statement and a work outline.

The decomposition and clean-up work described here. Somebody said these concepts carry meaningful information :slightly_smiling_face:. Should we disguise it from users in a poorly modeled, over-coordinated local concepts? Or should we make these concepts available for the global community, properly coordinated and classified by the Domains?

Great! That would be a good start - we can’t cover the entire space.
@Vojtech_Huser Would you be interested in taking it on if we call the entire event Measureathon :sweat_smile:? BTW, I love your proposal.

@QI_omop, @schillil Would you like to be invited?

This is closely connected to #7 since this is the replacement for the old STCM. The proposal is here, and it requires elaboration.

@clairblacketer perfectly outlined the recommendations here. But wouldn’t the “Outdated” category and an explicit ban (recommendation) on using this table in new implementations work best?

Hmm, we’re not a Working Group. Not a Tutorial either. If we manage to attract enough attention and participation, should we consider having a separate day for it :blush:?

Hi @Dymshyts @katy-sadowski, and welcome!

Would you like to invite @m-khitrun, @Eduard_Korchmar, @Patrick_Ryan, and the Vocabulary Committee to assess your new ideas and our previous options from the perspective of implementability and supportability?

Thanks, @Patrick_Ryan, for your support!
It reminds me of another topic of the Vocabulary management/exchange system/approach independently developed by the (i) Vocabulary team; (ii) @Andrew, @Jared, and @Polina_Talapova; (iii) @Javier, @gutte, and other FinOMOP people. @Andy_Kanter, @jamlung, Jon Payne, and OCL friends have their plans on the adoption of the OCL tooling for the CIEL vocabulary refresh and the general source delivery system.

Hi @aaronabend! That would be great! I’ll reach out to meet with the vocabulary leadership.

@llach Wonderful! In what context do you struggle with the problem?

@Vojtech_Huser Indeed, we require plural names. Then we decide which one is the concept name and which are synonyms.
Vocab-a-thon?

1 Like

Regarding the non-same-as standard concept discussion… we certainly have had a lot of back-and-forth about this issue for years. As a purveyor of interface terminology this has been high on my list for a decade. We have both the IMO example, where most OHDSI users in the US have an interface terminology code in their source systems already mapped to standards… but these are either not used to properly ETL the data into OMOP using the standards, or the interface code is not persisted in the CDM in a way that can be used for cohort development (at least not in Atlas). For CIEL, we have been thinking of proposing to promote the CIEL concepts which are mapped narrower-than to a standard, or require multiple standard codes to a standard. Concept-concept relationships should provide the map to existing standards when needed, but also allow for cohort defintion using pre-coordinated concepts that are in use in source systems.

So, basically we have 4 hours from 8:00 am to 12:00 pm 7th of October, the first day of the Symposium,
We split into groups, and each group will try to get closer to the resolution of their problem.
I’ll be leading the Precise mapping group, where we’ll tackle the problem of losing information when mapping (including mapping to multiple concepts which technically preserve the information, but making the analysis harder)

1 Like

@Dymshyts I’m so excited and can’t wait it any longer, too!

Hi all!

A quick but important update: the Vocabularyathon is now officially confirmed as part of the 2025 Global OHDSI Symposium agenda! We’ll meet in person on Tuesday, October 7th, from 8:00 a.m. to 12:00 p.m. for a focused, collaborative sessions on long-standing vocabulary challenges.

How to sign up?

You can now register for the Vocabularyathon during your Symposium registration.

If you already registered before this option was added, you can:

  • Log back into your Eventbrite registration and update your choices, or
  • Email symposium@ohdsi.org and ask the team to add you to the Vocabularyathon.

Confirmed Topics & Leads

We’re thrilled to already have the following teams confirmed:

  • Surveys & EAV-type data – led by Nicole Gerlanc
  • Vaccine vocabulary – led by Oliver He
  • Races/Ethnicities – led by Melanie Philofsky & Piper Ranallo
  • Visit definitions – led by Melanie Philofsky & Piper Ranallo
  • Non-same-as standard concept mapping – led by Dmitry Dymshyts
  • Procedure modifiers – led by Oleg Zhuk
  • Oncology-related topics – led by Asieh Golozar
  • Microbiology & drug susceptibility – led by Jared Houghtaling
  • Rare diseases (Orphanet, Monarch) – led by Bryan Laraway

Topics in Discussion

We’re also in active discussion on forming teams for:

  • Missing & reused NDCs

Topics Still in the Bummock

There are many other ideas floating - some with a rough outline but still waiting for someone to take them on. One example is the Family history problem summarized by @m-khitrun here. We’ve sketched out potential directions, but it needs a lead.

If you’ve got a topic that belongs in the Vocabularyathon or want to help lead one of the above, now’s the time to raise your hand.

2 Likes

The Vaccine vocabulary topic kick-off is scheduled for this Friday, August 29th at 10 a.m. ET. Everybody interested, please come to the Vaccine WG meeting in Teams or use this link.

Everybody is welcomed to review the call for a Precise Mapping group here.

Here are a couple of updates on the OHDSI Vocabulathon 2025:

1.Oncology-related topic led by Asieh Golozar is confirmed and will be part of the Vocabulathon.

2.Microbiology & drug susceptibility topic led by Jared Houghtaling is confirmed and will be part of the Vocabulathon.

3.Rare diseases topic led by Bryan Laraway is confirmed and will be part of the Vocabulathon.

4.The next two meetings of the CDM Survey Sub-group will be dedicated to preparing for the upcoming Vocabularyathon (Surveys & EAV-type data topic). Next meetings are Sept, 16 (10:00 AM ET), Sept 30 (10:00 AM ET). Everybody interested, please join using this link.
Please also review the plan for the Vocabulathon and provide feedback in this short questionnaire, https://forms.gle/7k8fYuN9FQ3bEwsw5, by Friday, September 12, EOD so we have discussion points for the nearest meetings.

5.The Vaccine vocabulary topic continues its sessions at the Vaccine vocabulary workgroup. The nearest sessions are each Friday, Sep 12th, 19th and 26th at 12 PM ET. Everybody interested, please join using this link.

:round_pushpin:Interested in joining the Vocabularyathon (Oct 7, 8:00–12:00)?
You can sign up when registering for the Symposium, or afterwards in your Eventbrite account, or email symposium@ohdsi.org to update your registration.

1 Like

Thanks, @Alexdavv!

I started a separate thread focused specifically on oncology vocabularies. If you’re working in this space, please join us: Vocabulathon 2025: What’s Next for Oncology Vocabularies?

1 Like

@Alexdavv

Do you have an agenda for the Vocabulathon? It’s time to start planning and I need to know how much time we have to discuss our topics

1 Like

Hi everyone,

It’s the first time we’re organizing the Vocabulathon. We’ve been thinking about how to best structure it and make the most of our esteemed vocabulary experts, since only a limited number of them will be attending the conference in person.
Here’s what we propose:

We have around 70 people registered so far and a total of 9 topics. All groups will work in the same room, divided by topic. We’ll begin with a general introduction and presentations from the topic leads. The focused work on each topic will then take place simultaneously at separate tables assigned to each group. Participants who want to contribute to more than one topic will need to move between tables and coordinate with the respective topic leads

The Races/Ethnicities topic will be handled a bit differently: instead of a brainstorming session, it will feature a more detailed presentation for the general audience. Questions and feedback from users can be shared during the break or networking time.

Agenda and timing:

8.00-8.20
Opening and Races/Ethnicities success story by Alexander Davydov and Melanie Philofsky.
Showcasing our prior experience working in a Vocabulathon-like format with community contributors and vocabulary experts.

8.20-8.50
Introductory presentations from topic leads (≈3 minutes each).
Brief outlines of the problem space, current or planned solutions, and goals for the Vocabulathon in-person event.

8.50-9.50
Brainstorming – Part I

9.50-10.00
Break (individual groups may adjust their schedules).

10.00-11.00
Brainstorming – Part II

11.00-12.00
Closing presentations from topic leads (≈7 minutes each) and organizers.
Highlighting progress achieved, goals defined, and implementation plans.

@MPhilofsky and others, please share your thoughts.

1 Like

I like the division of the agenda. I will co-ordinate with you, @Alexdavv, on the Opening.

For those who are interested in discussing the use cases for the Visit Occurrence table, the definition of a “Visit” in OMOP, and the ambiguity in the current, standard concept_ids for this domain; please join me at the OHDSI Vocabul-a-thon.

My concern with standard visit concepts is not only do they identify a visit (IP, OP, ER, etc.), but some also have Care Site and/or clinical event information. And some standard visit concepts are children to both an inpatient and an outpatient visit. It’s quite confusing for our analysts and researchers trying to create cohorts!

Things we need to do:

  1. Bring your use case! Historically, we used standard visit concepts as a proxy to severity. I’m only trusting a diagnosis of myocardial infarction if the clinical record is associated with an IP or ED visit.
  2. Let’s define “visit”. What does that mean from the patient perspective? For the most part, I thought we answered that question really well in our CDM v5.4 specifications. IMO, we need to get rid of “pharmacy, lab and ambulance” visits since those are care sites. And if the community decides to keep the combo Visit/Care Site, then this list needs to be greatly expanded. But if we do that, then do we get rid of the Care Site table? Researchers use the Care Site table to identify persons in the ICU when linked to a Visit Detail record. As always, there are pros and cons of every decisions.

I think we can accomplish #1 and #2 at the Symposium. If you can’t attend, please post here so we don’t clutter this thread. After we complete the first two steps, we can come up with some proposals (if we decide updates need to be done), create an issue with the Themis, CDM and/or Vocabulary WGs, and disambiguate the standard visit concepts.

Glad to hear of this initiative! (Unfortunately I won’t be at the symposium). I have another topic worth considering.

Similar to #10 (units to measurements), there can be a benefit to linking Routes to Drugs based on the medication’s form. By no means is it critical, but odd combinations of drug forms and routes can be useful to flag for QA purposes. Alternatively, this can also be a part of DQD’s plausibility checks.