Why is phenotyping difficult in OHDSI? Blame the concept set expression and vocabulary

Hopefully, I will get access to a Neo4j database containing an LPG representation of the concept_relationship table. I do not have experience with Cypher queries, so although theoretically this sounds interesting, I do not know if it will work yet. Here is an update on my assessment of the four key hypotheses:


H1 (Expressivity)

The superior expressivity of graph databases is well-established and can largely be considered established knowledge. However, it is still worthwhile to validate this specifically within the context of the OHDSI vocabulary. Once the Neo4j instance is available, and I have time to familiarize myself with the Cypher language, I will conduct specific tests and report back.

H2 (LLM Efficacy)

Hypothesis: LLMs can effectively translate natural language intent into executable graph queries, bridging the gap for non-technical users.

I assume that with the increasing maturity of LLMs and advancements in “Text-to-Query” fine-tuning, this hypothesis should be true. LLMs have demonstrated a sophisticated ability to understand both the semantic intent of a query and how to apply that intent to complex database schemas to render accurate queries.

To test this, I prompted Gemini with a couple of complex scenarios, assuming a standard LPG schema (Nodes labeled :Concept, relationships typed by relationship_id). The results were very convincing albeit untested:

Scenario 1: Cross-Domain Relationships

  • Natural Language Intent:

    “Find all conditions that are known indications for drugs containing the ingredient ‘Metformin’ (Concept ID 1503297).”

  • LLM-Generated Cypher:

    MATCH (ingredient:Concept {concept_id: 1503297})
    // Find drugs that have this ingredient. (Drug)-[:Has_ingredient]->(Ingredient)
    MATCH (drug:Concept)-[:Has_ingredient]->(ingredient)
    // Find conditions that the drug may treat. (Drug)-[:May_treat]->(Condition)
    MATCH (drug)-[:May_treat]->(condition:Concept)
    WHERE condition.domain_id = 'Condition'
    RETURN DISTINCT condition.concept_id, condition.concept_name
    

Scenario 2: Complex “Graph Concept Set” (Hierarchy + Attributes)

  • Natural Language Intent:

    “I want to create a concept set for ‘Breast Cancer’. This should include all descendants of ‘Malignant neoplasm of breast’ (Concept ID 4112853). However, I only want to include concepts where the associated morphology is specifically ‘Infiltrating duct carcinoma’ (Concept ID 4237178).”

  • LLM-Generated Cypher:

    // 1. Identify the starting hierarchical concept and the target attribute concept
    MATCH (start_concept:Concept {concept_id: 4112853})
    MATCH (target_morphology:Concept {concept_id: 4237178})
    
    // 2. Find all descendants of the starting concept (traversing backwards)
    // Note: Assumes (Descendant)-[:Subsumes]->(Ancestor) based on OHDSI standard
    MATCH (start_concept)<-[:Subsumes*0..]-(descendant:Concept)
    
    // 3. FILTER: Keep only descendants that also have the specified morphology relationship
    // (Condition)-[:Has_morphology]->(Morphology)
    WHERE (descendant)-[:Has_morphology]->(target_morphology)
    
    RETURN DISTINCT descendant.concept_id, descendant.concept_name
    

These examples strongly suggest that LLMs can effectively handle the complex, multi-hop logic required for advanced phenotyping without requiring the researcher to write the graph query manually. Ofcourse, I have given it concept_id in this prompt and have not tested or challenged it to pick concept_id. Picking concept_id is something that others are trying to solve now using semantic search, and this should be doable with NER/NEN.

H3 (UI/UX Feasibility)

Hypothesis: It is feasible to develop a user interface that allows researchers to intuitively construct and visualize graph-based queries.

My preliminary research suggests this is feasible. There are several dedicated Visual Query Builders, Visual Graph Exploration Tools, and Foundational Libraries (like Cytoscape.js) for custom development. Foundational libraries could be integrated into OHDSI tools (e.g., Atlas) to build a UI/UX for complex concept set expressions. I am going to assume that these Visual Query Builders are capable of supporting the construction of sufficiently complex Cypher queries.

H4 (Infrastructure)

This will be the last item to prove and likely the most challenging aspect of adoption. Obviously, replacing concept_relationship and concept_ancestor with a graph database backend would be a significant breaking change to the OHDSI tool stack and infrastructure. This requires careful consideration.