TAB Recommendation: OHDSI Specifications

During a recent session of the OHDSI Technical Advisory Board the topic of formal specifications for OHDSI artifacts was raised. By OHDSI artifacts, we mean concept sets, cohort definitions, and other json serializations of artifacts that are created and used within the OHDSI tools.

Currently there isn’t a formal, published specification for the JSON serialization format used for OHDSI concept sets. This is actually a common issue in the OHDSI ecosystem where practical implementations have evolved organically without always having comprehensive documentation.

This lack of formal specification can create challenges for:

  • Tool interoperability
  • Independent implementations
  • Data validation
  • Long-term preservation and migration

The TAB would like to propose a formal specification process take place where we, as a community, evolve and publish formal specifications for these artifacts.

The TAB site has been updated with a first draft of a specification for concept sets. We are open to all feedback on this specification and overall process and look forward to collaborating on taking these specifications from draft to published status.

Draft Concept Set Specification

3 Likes

Thanks so much for putting this together, Frank! The page is really great! Maybe a good Hackathon activity at the Symposium could be to implement validation against this schema in Capr :slight_smile: cc @mdlavallee92

1 Like

Thanks Frank for this, I think this is extremely helpful for anybody (or any llm!) developing tooling for working with the OMOP CDM.

I wonder if it would be worth incorporating some additional fields to provide more context for the concept set:

  • cdmVersion - probably not many concept sets that work only for 5.3 or 5.4, but perhaps some exist out there. So possibly useful to know which cdm version the concept was developed on? Alternatively could have a field indicating which version it should work on (so that it could be declared if a concept set was only for 5.4, etc)?
  • vocabularyVersion - at the moment seems there is no way to know which vocabulary version a concept was created for. And as the concept set will resolve to different concepts depending on the vocab version this seems like an important characteristic to record, as this will help guide considerations of potential re-use etc.

For the expression structure, I’m wondering if it is necessary to have conceptName, domainId, etc as required. Seems like the only required fields to resolve the concept sets are conceptId, isExcluded, includeDescendants, and includeMapped. So for me I would favour the additional fields as optional so that I can get to the smallest possible size when working with extremely large concept sets.

2 Likes

This is great. I was one of the advocates for pursuing this. Thank you Frank.

Creating specs like these advances the RWD informatics and OMOP model can provide leadership (be an example) in this (like it did in the past).

Note that cohort definition is next on our todo list (or long term vision).

I agree what what Ed is suggesting.

I also used visualizer to better see the definition

Having the schema as separate .json file on github would be better for use in scripts. Code can fetch it and not parse .md file to get schema.

Technical Advisory Board (TAB) meeting report from today: Sep 5, 2025

The group re-iterated the vision to define “entities” in OHDSI.

Tentative name is “Defining things roadmap or vision”.

Two clear milestones are concept set and phenotype (ideally don’t use term cohort. Cohort is reserved for instantiation of a phenotype).

Lee Evans indicated some early results around phenotype. (formats JSON and even YAML. Conversion between those. YAML allows comments while JSON does not. And having possibly validation code early version.

Next on the roadmap (phase 2) would be simple characterization, simple analysis and possibly study. Relationship to Strategus was discussed. Phase 2 is less clear to all. Since Strategus is a moving target and emerging somewhat. Notion of study spec is just a collection of smaller pieces for which each has some spec schema but the study construct is hard to get right without starting small.