Hi Friends!
As many of you know, vocabulary work in OHDSI has always been one of the toughest, but most essential, pillars of our ecosystem. Over the years, we’ve accumulated a long backlog of tricky, deep-rooted issues. Some of these challenges are visualized in what we call the bummock of the iceberg (see image below), and unfortunately, this part of the iceberg is mostly frozen in 2025.
But that doesn’t mean we should wait.
Since centralized vocabulary development will likely remain maintenance-focused through 2025, the only way to move forward is for us, the community, to come together and take initiative. And what better way than through something we all enjoy in OHDSI – the athons!
We’ve done many focused athons in OHDSI: Studyathons, Phenotypeathons, Devathons, and so on. But one thing we’ve never done, despite frequent promises, is a Vocabularyathon. It’s long overdue.
What’s the Format?
The proposal here is to host the first-ever Vocabularyathon during the 2025 Global OHDSI Symposium in the US.
Here’s how it could work:
- Participants will join teams based on vocabulary topics they care about.
- Each team will be facilitated by vocabulary experts, who’ll help frame ideas in line with OHDSI conventions and vocabulary principles.
- Ahead of the Symposium, teams will prepare an outline of their problem space.
- During the Symposium, we’ll use in-person time for brainstorming design ideas and potential solutions.
- We don’t expect the problems to be solved on-site. Many will require substantial follow-up effort, but we aim to produce shared understanding and direction to start up long-term collaboration within these groups.
Importantly, this isn’t a place for local or disease-specific issues. We’re focusing on the model, design, and vocabulary-wide issues. Think structural improvements, not national or project-specific gaps or one-off mappings.
What’s on the Table?
Here’s a list of core topics from our backlog, along with contributors who’ve already raised or worked on these issues. If you see yourself here or want to add something new, speak up below or join us during the Vocabulary WG session at the European OHDSI Symposium this Sunday!
1. Races/Ethnicities
@MPhilofsky and @piper proposed combining Race and Ethnicity domains into a unified one, using broader CDC categories and multiple source vocabularies to reflect nationality and mixed populations. The challenge is whether we actually try to create the shared definitions in OMOP and de-dup/map concepts based on them.
2. Procedure modifiers
@piper, @DTorok, and Themis WG discussed cleaning up CPT/HCPCS modifiers, potentially removing the modifier_concept_id field and restructuring modifiers by domain (route, laterality, etc.).
3. CPT4/HCPCS decomposition
VA contributors, @Dymshyts and @MPhilofsky debated how composite visit codes lost semantic richness after mapping to generic Visit and other domain concepts. Ideas emerged to preserve detailed Observations alongside them to ensure domain consistency, vocabulary principles and global (ex-US) adoption.
4. Missing & reused NDCs
Many folks flagged frequent gaps in NDC coverage and ambiguity from reused codes. Solutions may include better use of date logic, missing and unmapped code collection, with further contribution to Vocabularies.
5. Devices
@AsiyahFDA, @mmatheny, and the Device WG are pushing to support global device vocabularies, not just FDA UDIs, and to represent manufacturer/model attributes flexibly.
6. Rare diseases (Orphanet, Monarch)
The Orphanet team (Ana Rath, Marc Hanauer) and the German friends (@Michele_Zoch, @mik) led the mapping of ORDO, seeking formal vocabulary inclusion and robust mappings to SNOMED/ICD. The Monarch initiative (@mellybelly and @Bryan_Laraway) demonstrated how Mondo and the Human Phenotype Ontology (HPO) can be used to algorithmically identify rare disease cohorts in OMOP data, as shown in N3C. The new Rare Diseases WG, co-led by Xiaoyan Wang and @chunhua, is aiming to advance the Rare Disease research in OMOP which eventually relates to the vocabulary problem.
7. Complex mapping constructs
@Christian_Reich, Oncology WG, and the vocabulary team explored wide mapping tables and relationship groups for decomposing pre-coordinated concepts into multi-column outputs – ideal for surveys, labs, registries, etc.
8. Surveys & EAV-type data
@FrogGirl, the Survey WG, and the vocabulary team explored how to represent survey/questionnaire data without over-standardizing instrument-specific logic. Ideas include de-standardized storage with mapping to pre-coordinated concepts in respective domains.
9. RxNorm Extension & Drug domain quality
@aostropolets, @Eduard_Korchmar, and European teams (@mdewilde, Tom Seinen, @freija) have been looking to advance RxE coverage by integrating non-US drug sources, resolving overlaps with RxNorm and improving the Drug domain structure and approach.
10. Units to Measurements
@Vojtech_Huser and @MPhilofsky proposed standard units per measurement concept, and linking via relationships to support quality checks and unit conversions. @Ahmed-Medhat-Zayed flagged 100+ UCUM units with syntax issues, proposing corrections to improve machine-readability and support automated ETL workflows.
11. Vaccine vocabulary
Oliver He, Jie Zheng, and the Vaccine WG worked to build a hierarchical vaccine ontology (VO) in OMOP, aimed at classifying the existing standard vaccine concepts.
12. OpenMRS
@Andy_Kanter, @grace_potma, and collaborators from the African Chapter initiated mapping of the OpenMRS model and dictionaries into OMOP CDM to support broader global EHR inclusion.
13. Microbiology & drug susceptibility
@cukarthik and @Christian_Reich have been discussing the better representation of culture results and susceptibility tests – either through new tables or better concept design.
14. That one issue you’ve always wanted to bring up
You know the one. The vocabulary quirk that’s annoyed you for years, the modeling gap you’ve tiptoed around in every ETL, or the debate you’ve had three times already in chats. This is your chance to finally surface it. Propose it below and gather a team. No idea is too unfinished. Let’s refine them together!
Who Can Attend?
The short answer: everyone.
While many of the Vocabularyathon topics do require deep-dive expertise in OHDSI vocabulary structure, many tasks don’t. There’s plenty of work that just needs domain knowledge and willingness to collaborate.
For example:
- Gathering information about source vocabularies that could be added (new drug, procedure, race/ethnicity standards) doesn’t require technical vocabulary expertise. It’s about systemizing the right sources and healthcare systems.
- Compiling and comparing drug dictionaries and their structures from different countries or projects is valuable groundwork for the conceptual approach to the Drug domain further extension.
- Exploring how ATC classifications are used (or misused) across data partners is another area where hands-on experience with datasets matters.
What can You do?
-
Speak up in this thread.
If any of the above topics resonate, or you have a different general (not local!) vocabulary issue, add your name and ideas below. We’re shaping the issue/team lists from now on. -
Join the Vocabulary WG session this Sunday at the European OHDSI Symposium in Hasselt.
We’ll discuss this initiative live and start organizing further actions.
If you’ve ever thought, “This vocabulary problem has been sitting unsolved forever…”, then maybe this is our chance. Let’s make progress on the bummock together!
Looking forward,
Alexander and the vocabulary folks