Why is phenotyping difficult in OHDSI? Blame the concept set expression and vocabulary

Gowtham_Rao · August 30, 2025, 1:21pm

Hello OHDSI community,

Today, I want to discuss a known but perhaps less spoken about problem, inspired by the recent forum post from @Chris_Knoll (Using the 'Mapped' function of concept set expressions).

I am a big fan of the OHDSI Cohort Definition model (i.e., the CIRCE model) and the ATLAS UI for cohort definition as it provides an elegant framework for defining entry events, inclusion rules, and exit criteria. However, I often find myself not appreciating the limitations of its concept set expression, the structure of the OMOP vocabulary tables (concept_ancestor and concept_relationship), and the sometimes-painfully ambigious distinctions between standard vs. non-standard concepts.

I believe we must improve these foundational elements. We need to reduce the inherent ‘difficulty’ in phenotyping, minimize the need for ‘vocabulary SQL acrobatics’, and escape the recurring ‘version nightmares’.

1. Complex Concept Sets are Unavoidable

Let’s start with an obvious point: we cannot avoid using complex concept set expressions. Relying solely on a conceptId plus its descendants is rarely sufficient (Sorry, @Christian_Reich ). This is because of the polyhierarchy in concept_ancestor, simply pulling in all descendants often introduces specificity errors. Lets call this established and not debate this.

2. CIRCE Concept Set Expression does not support the concept_relationship thus ignores laterality

If we agree that complex expressions are necessary, the next question is: Do our tools support building them effectively, or are we forced into complex acrobatics?

Currently, the concept set expression model in CIRCE supports only two modes of traversal:

Hierarchy (Vertical): Using includeDescendants, which queries the pre-computed concept_ancestor table. This is fast and powerful for parent-child relationships.
Mapping (Limited Horizontal): Using includeMapped. This is an occasionally used acrobatic maneuver strictly limited to the hardcoded 'Maps to' relationship in concept_relationship (as discussed in the forum link above from @Chris_Knoll ).

This limitation means that the rich, ‘lateral’ ontological relationships stored in concept_relationship—such as ‘Has causative agent’, ‘Has finding site’, or ‘Has active ingredient’—are essentially locked away from us within the ATLAS/CIRCE framework. We cannot use them to define our concept sets.

3. Example: The Challenge of DILI (Drug-Induced Liver Injury)

Complex phenotypes like DILI perfectly illustrate this difficulty. The clinical idea is straightforward:

Find conditions representing liver injury that are NOT caused by other specific etiologies like viruses, alcohol, or physical obstruction.

Currently, we solve this using cohort definition acrobatics—perhaps 10 inclusion rules requiring exactly 0 occurrences of hepatic viruses within -7 to +7 days of the hepatitis event, etc. We do this because the entry event concept set cannot leverage laterality; it can only navigate parent-child relationships.

What we should be able to ask the vocabulary is: “Show me all descendants of ‘Liver Injury’ that DO NOT have a ‘Caused by’ relationship to a descendant of ‘Virus’ or ‘Alcohol’.”

Because CIRCE concept set expression cannot traverse the ‘Caused by’ relationship, we are forced into a painful, manual workaround:

Manual Curation: We must painstakingly search the vocabulary for every pre-coordinated concept that implies a non-drug cause (e.g., “Alcoholic hepatitis,” “Acute viral hepatitis B,” “Obstructive biliary disease”).
Brittle Exclusion List: We then add this long, manually curated list to the “Excluded Concepts” panel. Its brittle because the concept set expression does not automatically add any updates to our clinical idea on what may be causing the disease.

This workaround is not just tedious; it’s the primary source of the “difficulty” and “version nightmares” I mentioned. It is error-prone (leading to low recall or specificity errors), and the list becomes obsolete with the next vocabulary update, requiring the entire manual process to be repeated. And do not get me started on the added complexity of non-standard ICD10CM/ICD9CM code mapping.

4. Why This Matters for the Future - because the community is solving this phenotyping is difficult problem.

I am hearing about many innovations being built on top of the existing concept_ancestor structure and the current CIRCE model—for example, using generative AI or vector embedding-based semantic search to build concept set expressions.

Perhaps these approaches will succeed, and we can continue to rely solely on concept_ancestor without using the concept_relationship. But perhaps we are overlooking a solution that already exists within our ontology (i.e., laterality) and hamstringing these new tools by building them on an incomplete foundation.

This is even more critical in domains like cancer phenotyping, where laterality is essential for distinguishing key characteristics:

Origin/Status: Primary (originating in the lung) vs. Secondary (metastasis from elsewhere).
Morphology: Malignant (cancerous) vs. Benign.
Histology: Non-Small Cell Lung Cancer (NSCLC) vs. Small Cell Lung Cancer (SCLC).
Location: Lower respiratory tract (lung/bronchus) vs. Upper respiratory tract.

To achieve robust, reproducible, and scalable phenotyping, we need tools that can navigate the full richness of our vocabulary, not just the hierarchy.

There are many ways to tackle this problem - if we agree this is a problem. Maybe we should dream big - rethink this completely and adopt labeled property knowledge graphs and throw away the concept_ancestor and concept_relationship (making us AI/LLM ready). Maybe we should not blindly put any concept that does not have equivalence as non-standard (e.g. ICD10CM codes that do not have 1:1 equivalence mapping) and do lossy mapping but instead make them snomed extensions and make them standard with proper descendants. Maybe we should do a breaking change to circe to support knoweldge graphs instead of SQL.

Gowtham_Rao · August 31, 2025, 1:10pm

A little more about the two OMOP vocabulary tables: concept_relationship and concept_ancestor. concept_relationship provides the foundational, direct links between concepts. concept_ancestor is a derived optimization table specifically designed to handle complex hierarchies.

Concept Relationship (https://ohdsi.github.io/CommonDataModel/cdm54.html#concept_relationship)
The concept_relationship table stores the direct, asserted relationships (fundamental truths, axioms) between two concepts, as defined by the source vocabularies (e.g., SNOMED CT, RxNorm, LOINC) or curated by the OHDSI community vocabulary team. It defines how two concepts are related. It bridges different vocabulary domains (e.g., linking a condition to a drug that treats it) and different source terminologies (‘Maps to’, e.g., mapping ICD-10 to SNOMED). It only stores immediate relationships. If A is related to B, and B is related to C, this table stores (A-B) and (B-C), but not (A-C). It captures many relationships e.g., ‘Has active ingredient’, ‘Has finding site’, ‘Has causative agent’ via relationship_id with hierarchical relationships represented using ‘Is a’.

Concept Ancestor (https://ohdsi.github.io/CommonDataModel/cdm54.html#concept_ancestor)
The concept_ancestor table is derived from the hierarchical relationships (‘Is a’) present in concept_relationship. It represents the transitive closure of the vocabulary hierarchies. Its purpose is to provide a pre-calculated lookup of all hierarchical lineages. It stores beyond immediate relationships, i.e., If A is a B, and B is a C, concept_ancestor explicitly stores (A-B), (B-C), and (A-C)—but only in a hierarchical context (i.e., derived from ‘Is a’). It does not include non-hierarchical relationships like ‘Maps to’ or ‘Has ingredient’. Since it is a derived table from hierarchical relationships, it is limited to ‘Is a’ relationships between valid and standard concepts. By precalculating, the concept_ancestor transforms complex hierarchical traversals into simple, efficient JOIN operations. Without concept_ancestor, concept_relationship table based queries will require a complex and slow recursive CTE.

In summary, concept_relationship is difficult to query using SQL for deep relationships because of the explosion of CTEs and recursiveness. Concept_ancestor is an elegant workaround to achieve the goal of transforming complex hierarchical traversals into simple, efficient JOIN operations. However, concept_ancestor is limited to ‘Is a’/‘subsumes’ relationship_id, standard concepts (and classification?), and valid concepts.

In reality, concept_ancestor is a pragmatic choice designed to achieve the OHDSI mission of network studies on multiple DBMS with computational efficiency. It achieves this by using pre-computed content where possible. OHDSI tools (Circe) do not support the dynamic generation of complex SQL to traverse the concept_relationship table. Further, SQL-based queries that are overtly complex are likely to have performance issues and potentially fail over the multiple DBMS that OHDSI network studies would like to support.

But because it only supports a limited set of relationship_id (‘Is a’/‘Subsumes’) and because OHDSI tools do not support the richness of the concept_relationship, we can’t leverage these in the Cohort Definitions. Obviously, ‘Has causative agent’ is a special relationship_id that does not represent a parent-child/ancestor relationship.

The above is just one explanation as to why phenotyping is difficult. This difficulty is further explained by:

Standard/non-standard: We attempt to pick standard vs non-standard based on some quality judgments like whether a code follows principles like non-redundancy, atomicity, compositionality, clarity and unambiguity etc., but vocabulary creators have to make pragmatic decisions to balance expressive power and practical usability. Do we know how good is our classification of standard/non-standard. If we get a list of ICD codes as the source of truth, we waive it off as being inferior—probably because the effort to create a ‘standard’ based concept set expression is a pain, and it is so much easier to start a concept set expression from scratch.
Mapping choices: if we declare a code to be standard and a code to be non-standard, then we have to map them. If a non-standard code has a meaning that is not equivalent to a standard code, then we map in a lossy way to a broader standard concept. This, while intuitive, causes headaches in concept set expression building because selecting a descendant will increase the probability of pulling codes in unexpected ways.

Gowtham_Rao · August 31, 2025, 3:00pm

I think this problem may be succinctly described as the SQL Expressivity Gap. This gap is fundamentally a mismatch between the query language (SQL) and the underlying data structure (an ontology, which is inherently a graph). Graph databases (e.g., Labeled Property Graphs or LPGs) and their associated query languages (e.g., Cypher) are explicitly designed to handle the interconnected nature of ontological data.

In theory, they offer significant advantages over SQL in this domain:

Expressivity and Intuition: Graph languages can express complex biomedical criteria much more intuitively by directly mirroring the ontological structure.
Conciseness: A query traversing multiple relationship types that might require dozens of JOINs in SQL can often be represented in a few lines of Cypher.
Maintainability and Transparency: Simpler expressions enhance readability and reduce the overhead associated with maintaining and validating concept sets, particularly across vocabulary updates.
LLM Compatibility: The pattern-matching syntax of graph languages aligns well with the capabilities of Large Language Models (LLMs), suggesting significant potential for accelerated, AI-assisted phenotype development.

Illustrative Use Case: DILI

Continuing with the example of Drug-Induced Liver Injury (DILI):

“Create a concept set representing liver injury Concept IDs that are NOT caused by other specific etiologies like viruses, alcohol, or physical obstruction.”

In the current SQL/Circe paradigm, we resort to “vocabulary acrobatics”: manually curating extensive, brittle lists of pre-coordinated concepts (e.g., “Alcoholic hepatitis”) for exclusion. This list requires constant maintenance as the vocabulary evolves.

In a graph paradigm, this requirement can be expressed directly against the ontology: “Select all descendants of ‘Liver Injury’ that DO NOT possess a ‘Has causative agent’ relationship to any descendant of ‘Virus’ or ‘Alcohol’.”

A simplified conceptual Cypher query:

// Find all concepts descending from 'Liver Injury'
MATCH (injury:Concept)-[:IS_A*]->(:Concept {name: 'Liver Injury'})
// Ensure that the injury concept does not have a causative agent
// that descends from 'Virus' or 'Alcohol'
WHERE NOT EXISTS {
    MATCH (injury)-[:HAS_CAUSATIVE_AGENT]->(cause)
    WHERE (cause)-[:IS_A*]->(:Concept {name: 'Virus'})
       OR (cause)-[:IS_A*]->(:Concept {name: 'Alcohol'})
}
RETURN injury.concept_id

This approach is declarative, robust to vocabulary updates, and directly reflects the clinical intent. While concept_ancestor facilitates efficient hierarchical queries using SQL by pre-calculating the transitive closure of the ‘Is a’ relationships, it cannot support the lateral, axiomatic filtering required here.

Specifically, the concept_ancestor table cannot execute the following logic:

// === AXIOMATIC (Lateral) Filtering - Exclusion Criteria ===
// Filter the candidate set by excluding specific etiologies. This is where
// we overcome the SQL Expressivity Gap by utilizing asserted axioms.
WHERE NOT EXISTS {
    // Step 2a: LATERAL Traversal
    // We traverse from the condition (injury) to its asserted cause using the
    // 'HAS_CAUSATIVE_AGENT' relationship. This is the critical lateral link
    // (stored in concept_relationship) that is generally inaccessible in standard
    // ATLAS/Circe concept set definitions.
    MATCH (injury)-[:HAS_CAUSATIVE_AGENT]->(cause)

    // Step 2b: HIERARCHICAL Traversal on the Cause
    // Verify if that identified 'cause' is a descendant of 'Virus' OR 'Alcohol'.
    WHERE (cause)-[:IS_A*]->(:Concept {name: 'Virus'})
       OR (cause)-[:IS_A*]->(:Concept {name: 'Alcohol'})
}

Integrating Graph Queries within the CIRCE Model

It is possible to replace SQL-based concept set resolution with a graph-based approach within the existing OHDSI framework. The Concept Set Expression is a modular component within the broader Circe cohort definition JSON structure. Currently, this expression is rendered into SQL, which resolves to a temporary table containing the relevant Concept IDs.

It is architecturally feasible to replace this SQL rendering engine with a graph query execution engine. The graph query would fulfill the same contract—returning a set of Concept IDs—which can then be utilized by the subsequent stages of the cohort definition logic. While this implementation is not trivial, it maybe a consideration for the roadmap of Atlas 3.x @anthonysena @Chris_Knoll .

Is Graph Empirically Superiority?

How do we empirically test that a graph language is superior to the current SQL-based approach? We propose the following hypotheses:

H1 (Expressivity): Graph languages (e.g., Cypher) can represent complex concept set expressions with greater conciseness and fidelity to clinical intent compared to current Circe/SQL implementations.
H2 (LLM Efficacy): LLMs can reliably generate and validate graph-based concept set expressions, significantly improving productivity in phenotype development.
H3 (UI/UX Feasibility): It is feasible to develop a user interface within tools like Atlas that allows researchers to intuitively construct and visualize graph-based queries, replacing the current concept set editor.
H4 (Infrastructure): A graph database representation of the OMOP vocabulary can effectively supersede the functional roles of concept_relationship and the derived concept_ancestor table while meeting performance requirements.

Implementation Pathways

To test these hypotheses, we need to construct a knowledge graph database. Two primary pathways are available:

Reuse and transform OMOP-Vocabulary into Graph DB (Pragmatic Short-Term): Convert the existing concept_relationship table into a graph representation (LPG or RDF). This is the most straightforward approach for initial hypothesis testing, as the knowledge graph is easier to construct from existing tables.
UMLS-Native Approach (Strategic Long-Term): If we establish the superiority of knowledge graphs, we can pursue a more robust solution. This involves loading the vocabulary directly from the UMLS Metathesaurus into a graph database, adding the OMOP concept_id as a property to the corresponding UMLS Concept Unique Identifier (CUI). This approach ensures native alignment with the foundational source ontology, enhancing defensibility, traceability, and interoperability. This strategy could eventually render concept_ancestor and significant parts of concept_relationship obsolete.

I think, OHDSI community adopting a graph-based approach extends far beyond phenotyping. It establishes a foundation for advanced AI applications, including Graph Retrieval-Augmented Generation (Graph RAG) and sophisticated semantic search. We can embed concept_id and even embeddings as properties in graph, and make them a community asset in Neo4js. Furthermore, this infrastructure improves the reliability of AI outputs by facilitating standardized Named Entity Recognition (NER) and Normalization (NEN) anchored to OMOP/UMLS concepts within prompts.

This evolution is likely a necessary step toward realizing the full potential of observational research and ensuring the OHDSI ecosystem is prepared for the next generation of AI-driven evidence generation.

Gowtham_Rao · August 31, 2025, 3:47pm

Hopefully, I will get access to a Neo4j database containing an LPG representation of the concept_relationship table. I do not have experience with Cypher queries, so although theoretically this sounds interesting, I do not know if it will work yet. Here is an update on my assessment of the four key hypotheses:

H1 (Expressivity)

The superior expressivity of graph databases is well-established and can largely be considered established knowledge. However, it is still worthwhile to validate this specifically within the context of the OHDSI vocabulary. Once the Neo4j instance is available, and I have time to familiarize myself with the Cypher language, I will conduct specific tests and report back.

H2 (LLM Efficacy)

Hypothesis: LLMs can effectively translate natural language intent into executable graph queries, bridging the gap for non-technical users.

I assume that with the increasing maturity of LLMs and advancements in “Text-to-Query” fine-tuning, this hypothesis should be true. LLMs have demonstrated a sophisticated ability to understand both the semantic intent of a query and how to apply that intent to complex database schemas to render accurate queries.

To test this, I prompted Gemini with a couple of complex scenarios, assuming a standard LPG schema (Nodes labeled :Concept, relationships typed by relationship_id). The results were very convincing albeit untested:

Scenario 1: Cross-Domain Relationships

Natural Language Intent:

“Find all conditions that are known indications for drugs containing the ingredient ‘Metformin’ (Concept ID 1503297).”

LLM-Generated Cypher:

MATCH (ingredient:Concept {concept_id: 1503297})
// Find drugs that have this ingredient. (Drug)-[:Has_ingredient]->(Ingredient)
MATCH (drug:Concept)-[:Has_ingredient]->(ingredient)
// Find conditions that the drug may treat. (Drug)-[:May_treat]->(Condition)
MATCH (drug)-[:May_treat]->(condition:Concept)
WHERE condition.domain_id = 'Condition'
RETURN DISTINCT condition.concept_id, condition.concept_name

Scenario 2: Complex “Graph Concept Set” (Hierarchy + Attributes)

Natural Language Intent:

“I want to create a concept set for ‘Breast Cancer’. This should include all descendants of ‘Malignant neoplasm of breast’ (Concept ID 4112853). However, I only want to include concepts where the associated morphology is specifically ‘Infiltrating duct carcinoma’ (Concept ID 4237178).”

LLM-Generated Cypher:

// 1. Identify the starting hierarchical concept and the target attribute concept
MATCH (start_concept:Concept {concept_id: 4112853})
MATCH (target_morphology:Concept {concept_id: 4237178})

// 2. Find all descendants of the starting concept (traversing backwards)
// Note: Assumes (Descendant)-[:Subsumes]->(Ancestor) based on OHDSI standard
MATCH (start_concept)<-[:Subsumes*0..]-(descendant:Concept)

// 3. FILTER: Keep only descendants that also have the specified morphology relationship
// (Condition)-[:Has_morphology]->(Morphology)
WHERE (descendant)-[:Has_morphology]->(target_morphology)

RETURN DISTINCT descendant.concept_id, descendant.concept_name

These examples strongly suggest that LLMs can effectively handle the complex, multi-hop logic required for advanced phenotyping without requiring the researcher to write the graph query manually. Ofcourse, I have given it concept_id in this prompt and have not tested or challenged it to pick concept_id. Picking concept_id is something that others are trying to solve now using semantic search, and this should be doable with NER/NEN.

H3 (UI/UX Feasibility)

Hypothesis: It is feasible to develop a user interface that allows researchers to intuitively construct and visualize graph-based queries.

My preliminary research suggests this is feasible. There are several dedicated Visual Query Builders, Visual Graph Exploration Tools, and Foundational Libraries (like Cytoscape.js) for custom development. Foundational libraries could be integrated into OHDSI tools (e.g., Atlas) to build a UI/UX for complex concept set expressions. I am going to assume that these Visual Query Builders are capable of supporting the construction of sufficiently complex Cypher queries.

H4 (Infrastructure)

This will be the last item to prove and likely the most challenging aspect of adoption. Obviously, replacing concept_relationship and concept_ancestor with a graph database backend would be a significant breaking change to the OHDSI tool stack and infrastructure. This requires careful consideration.

Gowtham_Rao · August 31, 2025, 5:08pm

Regarding the conceptual implementation pathways for creating a graph database.

1. Converting Current OMOP-Vocabulary tables into Graph DB:

Migrating the existing OMOP vocabulary tables into a Labeled Property Graph (LPG) involves restructuring the relational data. This transformation turns the implicit relationships defined by relational joins into explicit, traversable edges within the graph. The migration sources are the core tables: concept, concept_relationship, relationship, concept_synonym, vocabulary, and domain.

1.1. Defining the Node Model (Entities):
The central entity is the Concept. Properties such as concept_id, concept_name, concept_code, standard_concept, concept_class_id, and validity dates are stored directly on the node. We may also want to add secondary labels based on the domain_id (e.g., :Drug, :Condition, :Procedure). We may need metadata nodes sourced from the domain and vocabulary tables to provide context and provenance. Data from the concept_synonym table must be linked to the appropriate concepts, optimally stored as a list or array property (e.g., ‘synonyms’) on the corresponding :Concept node.

1.2. Defining the Edge Model (Connections):
The core of the model consists of Semantic Edges (Concept-to-Concept). These are directed edges created from the node identified by concept_id_1 to the node identified by concept_id_2. The edge type is defined by the relationship_id (e.g., ‘Is a’, ‘Maps to’), which specifies the semantic nature of the connection (e.g., (c1)-[:Is_a]->(c2)). Edge Properties must include the validity information (valid_start_date, valid_end_date, invalid_reason), as this data pertains to the relationship itself. We can also add contextual edges to link the :Concept nodes to their metadata, replacing the domain_id and vocabulary_id foreign keys. The relationship table provides the definitions for these edge types. Attributes like is_hierarchical, defines_ancestry, and information on reverse relationships inform how algorithms should traverse the graph (e.g., identifying which relationships to follow when determining lineage).

2. Convert UMLS-Native files into Graph DB:

Conceptually, constructing an LPG directly from the UMLS The Metathesaurus, and subsequently enriching it with OMOP concept_ids, should be possible.

The UMLS Metathesaurus integrates over 200 biomedical vocabularies (e.g., SNOMED CT, RxNorm, LOINC). It is distributed primarily in the Rich Release Format (RRF) Metathesaurus Data Files, a set of pipe-delimited relational files.

The National Library of Medicine (NLM) provides loaders/parsers for RRF. The official tool, MetamorphoSys MetamorphoSys, allows for customizing, subsetting, and parsing the RRF files. It also includes scripts to load the UMLS data into standard relational databases (RDBMS), such as MySQL or PostgreSQL. Next, we could use ETL tools (e.g., Neo4j’s) to migrate the data from the RDBMS tables into the target graph structure. This approach leverages the NLM’s official parser and manages the complexity of the RRF format.

We must decide on an appropriate Graph Model. This requires consideration of the different identifier systems: OMOP mappings rely on specific source codes and vocabulary names, while UMLS uses identifiers such as CUI (Concept Unique Identifier), LUI (Lexical Unique Identifier), and AUI (Atom Unique Identifier).

The nodes (entities) may represent CUIs or AUIs, with properties such as the term string, term type (TTY), name, and source vocabulary (SAB). The edges (connections) should link concepts to atoms (e.g., CUI to AUI) and concepts to other concepts (CUI to CUI). These edges will also need appropriate properties—such as the relationship type (REL/RELA) and the source asserting the relationship—to represent their provenance.

Finally, we have to map and attach the corresponding OMOP concept_id as a property to the relevant nodes. This approach effectively replaces the need for the OMOP concept_ancestor, concept_relationship, and concept_synonym tables, as these relationships are inherently captured by the graph structure, while still retaining the utility of the OMOP concept_id.

Implications:

Converting UMLS-Native files into a Graph DB is significantly more complex, but such a resource offers substantial advantages. It can serve applications beyond the OHDSI ecosystem while simultaneously supporting OHDSI requirements. For example, the integrated graph (containing UMLS CUIs linked to OMOP concept_ids) can be highly effective in AI applications involving Named Entity Recognition (NER) and Named Entity Normalization (NEN). Furthermore, this approach simplifies OMOP vocabulary maintenance, as we can theoretically deprecate the concept_relationship and concept_ancestor tables.

This transformation is a lot of work but our community has experts who are able to perform these transformations. Our OMOP vocabulary team @Anton_Tatur and others have created the concept_ancestor and concept_relationship tables. @Anton_Tatur and I are going to try this out.

Why are we interested in this problem space:
In addition to my hunch that knowledge graph may help us improve concept set creation and thus better phenotyping, we are also working on several efforts to leverage public knowledge in scientific and technical documents for making AI enabled decision making from drug development, to drug launch and pharmacovigilance. More on that to follow. I will probably discuss that in the Generative AI workgroup soon.

katy-sadowski · September 18, 2025, 12:04am

Hi @Gowtham_Rao , this is a very interesting post and I’m still working my way through the ideas slowly I have one basic question though - are the other types of lateral relationships you mention fully robust in the vocabulary? Today, can I use “Has causative agent” and be confident I’ve caught all the variants of a concept with a given causative agent? Is the vocabulary team curating these and making sure they’re comprehensive?

Gowtham_Rao · September 18, 2025, 12:47am

so running this query

select relationship_id, count(*) countx
FROM cdm.concept_relationship
group by relationship_id
order by countx desc;

gives me

relationship_id	Count
Maps to	7,299,489
Mapped from	7,299,489
Subsumes	3,441,303
Is a	3,441,303
Has marketed form	2,020,882
Marketed form of	2,020,882
RxNorm has dose form	1,797,987
RxNorm dose form of	1,797,987
Has brand name	1,478,520
Brand name of	1,478,520
Tradename of	1,060,811
Has tradename	1,060,811
RxNorm - SPL	764,666
SPL - RxNorm	764,666
Status of	640,588
Has status	640,588
Has Module	632,202
Module of	632,202
Concept replaced by	612,702
Concept replaces	612,702
Has supplier	581,089
Supplier of	581,089
Value to Schema	545,166
Schema to Value	545,166
Box of	510,680
Available as box	510,680
RxNorm inverse is a	489,035
RxNorm is a	489,035
Constitutes	441,723
Consists of	441,723
Quantified form of	367,789
Has quantified form	367,789
Method of	297,562
Has method	297,562
RxNorm has ing	280,736
RxNorm ing of	280,736
Has finding site	275,246
Finding site of	275,246
Asso morph of	222,733
Has asso morph	222,733
Has dir proc site	162,453
Dir proc site of	162,453
Answer of	154,118
Has Answer	154,118
Component of	122,694
Has component	122,694
Concept same_as to	115,579
Concept same_as from	115,579
Variable to Schema	114,535
Schema to Variable	114,535
Property of	109,489
Has property	109,489
RxNorm - Source eq	102,380
Source - RxNorm eq	102,380
Has scale type	101,220
Scale type of	101,220
Time aspect of	100,407
Has time aspect	100,407
Has system	98,612
System of	98,612
Interprets of	93,427
Has interprets	93,427
Concept poss_eq from	85,785
Concept poss_eq to	85,785
RxNorm - ATC	77,898
ATC - RxNorm	77,898
Drug has drug class	73,842
Drug class of drug	73,842
May treat	67,475
May be treated by	67,475
Concept was_a from	64,076
Concept was_a to	64,076
Transcribes to	63,118
Is transcribed from	63,118
Schema to ICDO	60,203
ICDO to Schema	60,203
Histology of ICDO	59,914
Has Histology ICDO	59,914
Is translated from	59,159
Translates to	59,159
Has Topography ICDO	58,438
Topography of ICDO	58,438
ICDO to Proc Schema	57,279
Proc Schema to ICDO	57,279
Panel contains	54,630
Contained in panel	54,630
Access of	48,611
Has access	48,611
Has causative agent	45,114
Causative agent of	45,114
Has part of	44,456
Part of	44,456
Has product comp	41,469
Product comp of	41,469
Has active ing	41,302
Active ing of	41,302
CI to	40,884
CI by	40,884
Has dose form group	40,772
Dose form group of	40,772
RxNorm - ATC sec up	39,015
ATC - RxNorm sec up	39,015
Physiol effect by	38,640
Has physio effect	38,640
Has pathology	37,680
Pathology of	37,680
Dose form of	32,457
Has dose form	32,457
Laterality of	32,056
Has laterality	32,056
Has occurrence	31,779
Occurrence of	31,779
Using device	31,441
Device used by	31,441
Interpretation of	30,947
Has interpretation	30,947
Has due to	26,718
Due to of	26,718
Has proc site	26,677
Proc site of	26,677
Using acc device	26,585
Acc device used by	26,585
Has dir device	26,310
Dir device of	26,310
NDFRT ing of	25,755
NDFRT has ing	25,755
Using subst	25,529
Subst used by	25,529
Has indir proc site	25,278
Indir proc site of	25,278
Has MoA	24,422
MoA of	24,422
Intent of	24,373
Has intent	24,373
RxNorm - SNOMED eq	22,895
SNOMED - RxNorm eq	22,895
Drug-drug inter for	21,662
Has drug-drug inter	21,662
Dir morph of	21,140
Has dir morph	21,140
Disposition of	18,633
Has disposition	18,633
Maps to value	18,370
Value mapped from	18,370
Contained in	17,825
Contains	17,825
Has basis str subst	17,129
Basis str subst of	17,129
Prec ingredient of	16,677
Has prec ingredient	16,677
CPT4 - SNOMED eq	16,206
SNOMED - CPT4 eq	16,206
Dir subst of	15,910
Has dir subst	15,910
VAProd - RxNorm eq	15,312
RxNorm - VAProd eq	15,312
Has precise ing	15,255
Precise ing of	15,255
CPT4 - SNOMED cat	15,079
SNOMED cat - CPT4	15,079
Has relat context	14,441
Relat context of	14,441
NDFRT dose form of	14,237
NDFRT has dose form	14,237
Temporal context of	14,186
Has temporal context	14,186
ICDO to Chapter	13,246
Chapter to ICDO	13,246
Has form	12,543
Form of	12,543
Value of	12,244
Has Value	12,244
Has modality	12,211
Modality of	12,211
Focus of	12,136
Has focus	12,136
Cytotox chemo RX of	11,102
Has cytotox chemo Rx	11,102
Cytotoxic chemo of	11,030
Has cytotoxic chemo	11,030
Accepted use of	10,889
Has accepted use	10,889
Numerator unit of	10,855

SELECT
c1.concept_name AS concept_1,
cr.relationship_id,
c2.concept_name AS concept_2
FROM cdm.concept_relationship cr
INNER JOIN cdm.concept c1
ON cr.concept_id_1 = c1.concept_id
INNER JOIN cdm.concept c2
ON cr.concept_id_2 = c2.concept_id
WHERE cr.relationship_id = ‘Has causative agent’ and lower(c2.concept_name) LIKE ‘%virus%’
ORDER BY concept_2
LIMIT 10;

schuemie · September 18, 2025, 5:28am

I had the same question as @katy-sadowski.

Has anybody evaluated the accuracy and completeness of these relationships?

m-khitrun · September 18, 2025, 7:52am

Hi @katy-sadowski and @schuemie

The table ‘relationship’ has a field ‘relationship_name’, where you can find the source of the relationships of interest:

The mentioned attribute relationships are sourced from SNOMED, and the vocabulary team doesn’t perform any curation of them. We assume that SNOMED ensures their completeness and accuracy.

Best,
Masha

Christian_Reich · September 18, 2025, 2:42pm

Thank you, thank you, thank you for starting with a use case. And this is a good one. It would be wonderful if the composite concepts had relationships to their attributes, like “caused by”, that we could exploit. And whatever you are proposing below will have to rely on quality attribute relationships like that. In the case of Conditions we are in an ok shape, thanks to SNOMED (which is why I vehemently oppose anybody saying “let’s go to ICD”). Other domains are deficient or completely devoid of that, such as devices, visits, observations, measurements.

Hahaha. You should say “thank you @Christian_Reich”, because the reality is that only the polyhierarchy is of reasonably high quality. The lateral relationships are nice to haves. Some vocabularies have them (SNOMED, RxNorm, not LOINC), most of them do not. We just import them, but we have no chance to curate or quality check them at all.

Not quite correct. All relationships that are used in building the polyhierarchy have a “1” in the field defines_ancestry of the RELATIONSHIP table. Whether they are lateral or hierarchical (and min_levels_of_separation is incremented) is defined by the flag is_hierarchical in the same table. But that doesn’t change what you are saying.

The UMLS is not curating these, either. They take what they get, just like us.

There is definitely a technology push necessary into OHDSI, which is probably a good thing. But: If we wanted to do any of the nice things you are saying we’d need a big push towards creating those ontological assets. With the LLMs this should be much easier, but it still would be a gigantic content creation job. I am not seeing it anywhere in your proposal.

Well, @m-khitrun answered the question. We are not checking for accuracy (which would be fairly easy to do, but is still work), and we are certainly not checking completeness (equally important, much harder to curate).

Bottom line: We need to talk about the ontology and the resources it will take to build it.

Gowtham_Rao · September 19, 2025, 12:39am

@Christian_Reich you know very well the whole community is in debt for what you have done for all of us

if there is any community crazy to do it for free - its this community

and now we all agree why phenotyping is difficult

Vojtech_Huser · September 19, 2025, 4:01pm

I fully agree that completeness is much harder than accuracy.
I like naming somehow the lateral relationship.

There are 2 types from perspective1.

Lateral-definitional: part of defining a term (e.g., LOINC components of codes, SNOMED condition (if not primitive). Think component terms relationships to composite terms (non primitive terms)
Lateral-non-definitional: originating outside definition

Note, however, that many lateral-definitional relationships do make their way into hieararchy. At least in SNOMED, the computer-assisted placement of terms into correct parent concepts in fact converts a large portion of lateral relationships into parent-child (is-a) relationships.

“Opposite”/companion/complement of lateral relationship is hierarchical (=parent-child, =is a) relationship.

SNOMED free use term subset is giving away the terms and titles but NOT the fancy definitions (relationships to component terms. So some terminologies have IP (intellectual property) that comes into play. The agreement with SNOMED Int. that OHDSI may have - if we open the box of importing lateral-definitional relationships - oh boy…

Also note that those require groupings of relationships and that is something current vocab tables can’t handle. (see link at the bottom (and maybe later I may add more examples of grouping (and ideally how they make or NOT MAKE their way into is-a hierarchy)

From perspective 2: creator of relationships

authored by SDO (e.g., LOINC gives us their hierarchy and groupins) (without IP restrictions)
authored by community (not given by SDO during terminology download)

Community can be OHDSI (e.g., RxNormExtension) but it could also be some non OHDSI community (e.g., wikidata).

Also consider maping holder/owner and submission of issues to that holder entity. (where we import it from).

We don’t want version disconnect. If SNOMED term is wrong, don’t tell Athena team, submit a ticket to SNOMED holder.

If we want improvement in relationships, we can dream of SDO doing it using taxpayer money (SNOMED model; annual contribution) or be realistic and accept that if we want progress, it may have to come from community.
Long term funding for infrustructure in medicine is unpopular with many funders.

If we don’t adopt any imperfect/community mapping into Athena, this does not advance humanity further. We may import them even if we know they can be inaccurate or incomplete and encourage folks to submit additions/corrections to the holder entity. (not to OHDSI).

In a way we are doing it today as well. We say: “use our Dx mappings but also triple check all mappings” - if your rigor is regulatory grade…

We have community contribution (of type add concept) for OHDSI but we may need community mini-contribution (of type downvote an incorrect relationship [and tell the holder !!])

This whole problem is solved if concept set definition is done in R or Python or code. (and a smart way to use it later in GUI). That way any external and smart relationship can be incorporated without acrobatics in Athena or GUI. And introducing exploded complexity in Athena layer

We would first have to define concept set as a dynamic component of phenotype (currently it is a static snapshot copy-of). I am surprised we survived for this long resisting the demand for it.
That is a culpit to solve for the whole challenge of advanced-researcher phenotype definition.
Well, TAB is on it with the schema defintions a bit and maybe Atlas 3 has that as starting point too.

ADDITIONAL INFO
Grouping problem link (for hardcore terminologiest only)
Expressions With Attribute Groups | SNOMED International Documents

here is the promised example of grouping
(putting angle brackets on link forces expansion to be off, I finally googled it)

https://browser.ihtsdotools.org/?perspective=full&conceptId1=1260293004&edition=MAIN/2025-09-01&release=&languages=en&latestRedirect=false

as svg (SVGs are not supported) (the above picture is not rendered well by the forum (Discourse)

Also note that state vs inferred (in SNOMED) matters. Inferred of course but flipping between those allows to see the computational placement of defined terms into computer-only managed parents. (because humans are less perfect at this).