OHDSI Phenotype workgroup updates

Meeting Summary & Analysis

OHDSI Phenotype Development and Evaluation Workgroup Date: July 25, 2025

Participants

  • Meeting Lead: Gowtham Rao
  • Contributor: Azza Shoaibi
  • Contributor: Christopher Mecoli
  • Contributor: Joel Swerdel
  • Contributor: Ben Hamlin
  • Contributor: Juan M Banda
  • Contributor: Jacqueline Honerlaw
  • Contributor: Tatsiana Skuhareuskaya

Executive Summary

The workgroup reconvened to celebrate a significant publication, pivot toward new research avenues, and align on administrative changes. The session was marked by a vibrant discussion on leveraging Large Language Models (LLMs) for phenotype development, sparking a debate on the appropriate evaluation methodology—balancing AI-driven generation against human expertise and established community resources. The group agreed to formalize an experiment comparing human vs. LLM concept set creation. Key updates were also shared on the Book of OHDSI and the ongoing integration between the OHDSI and VA Cypher phenotype libraries, reinforcing a push for greater interoperability and standardized metadata.

Topics Discussed

  1. Workgroup Administration & OKR Review
  2. Publication of Rheumatic Disease Phenotype Paper
  3. Proposed Experiment: LLM vs. Human-Generated Concept Sets
  4. Framework for Phenotype Evaluation
  5. Book of OHDSI & Phenotype Library Updates
  6. Cross-Library Collaboration & Future Directions
  7. Action Items & Next Steps

Detailed Topic Analysis

1. Workgroup Administration & OKR Review

Gowtham Rao opened the meeting, announcing a change in cadence to once a month, on the fourth Friday. He briefly reviewed the 2025 Objectives and Key Results (OKRs), noting progress on enhancing the science of phenotyping while acknowledging that some initiatives may need to be simplified or closed due to delivery challenges.

2. Publication of Rheumatic Disease Phenotype Paper

The group celebrated the recent publication of a paper on phenotyping a rare rheumatic disease, led by Dr. Christopher Mecoli. Dr. Mecoli described it as a “long process” that involved collaboration with multiple institutions and validated the output of the PheValuator tool through manual chart review. Gowtham Rao praised Dr. Mecoli’s end-to-end leadership as a clinician, highlighting his journey from learning the OHDSI tools to leading a team and publishing the work. The plan is to expand this validated methodology to other rheumatic and autoimmune diseases.

3. Proposed Experiment: LLM vs. Human-Generated Concept Sets

Joel Swerdel introduced a plan for a formal experiment to evaluate the use of LLMs in generating concept sets. The proposed methodology involves having one team create a concept set manually while another uses an LLM. A clinical adjudicator would then compare the two outputs to determine their accuracy and calculate performance statistics. Joel called for volunteers from the community, including clinicians to act as adjudicators and others to help develop the concept sets, suggesting the work could be a focus at the upcoming OHDSI symposium.

4. Framework for Phenotype Evaluation

Joel’s proposal sparked a broader discussion on the foundational principles of phenotype evaluation. Ben Hamlin argued for establishing a more explicit, abstract model framework within the Book of OHDSI’s Chapter 11. He felt that while the chapter contains excellent information, the core principles are “kind of buried.” Ben stressed the need for a clear, scientifically validated “gold standard” to ensure referential integrity, especially as the community moves toward using AI and expanding the phenotype library.

This led to a debate on the nuances of evaluation. Juan M Banda questioned the premise of a “fair” comparison between a human and an LLM, noting the LLM’s vast, embedded knowledge of literature versus a human’s practical experience. He cautioned that LLMs can “come up with a bunch of garbage” and suggested that not using the existing Phenotype Library as a baseline could erode trust in it. Chris Mecoli proposed taking the evaluation a step further with chart reviews to see which concept set—AI or human—better identifies the true clinical concept, though he acknowledged the significant effort required. Azza Shoaibi synthesized the discussion, noting that Joel’s LLM workflow is sophisticated, using OHDSI’s vocabulary ontology and the Phoebe tool to assess concept prevalence, making it more than a simple query.

Key Takeaway: The group converged on the idea that a more structured evaluation process is needed. While clinical adjudication remains the gold standard for concept sets, the experiment provides an opportunity to refine and formalize a multi-faceted evaluation framework that could incorporate tools like CohortDiagnostics and PheValuator alongside chart review.

5. Book of OHDSI & Phenotype Library Updates

Azza Shoaibi reported that the first draft of Chapter 11 of the Book of OHDSI is complete and open for community review. The content is a curation of ideas discussed over the past several years. She specifically requested that Juan M Banda review and update the section on probabilistic phenotyping. She and Ben Hamlin agreed to collaborate on creating a clearer visual diagram for the evaluation framework to be included in the chapter. Gowtham Rao also committed to finalizing the OHDSI Phenotype Library paper with Juan.

6. Cross-Library Collaboration & Future Directions

Jacqueline Honerlaw provided an update on the integration between the VA’s Cypher Phenotype Library and the OHDSI library. The collaboration is expanding to include the PKB and HDRUK phenotype libraries, which she believes will make for a more compelling joint publication. This work highlights the need for a common metadata standard across libraries.

Gowtham Rao affirmed that this aligns with the future direction for the OHDSI Phenotype Library, which needs a better user interface and clearer contribution pathways. He called for volunteers to form a subgroup to work on the next generation of the library.

7. Action Items & Next Steps

Azza Shoaibi summarized the key action items from the meeting:

  • LLM Experiment: Joel and Azza will draft a formal protocol for the LLM vs. human concept set experiment and share it with the workgroup to form a dedicated team.
  • Book of OHDSI:
    • Juan will update the probabilistic phenotyping section.
    • Azza and Ben will develop a new diagram for the evaluation framework.
    • All members are encouraged to review the chapter.
    • Sajjan will assist with formatting and citations.
  • Phenotype Libraries:
    • Gowtham will clean up and release the next batch of OHDSI cohort definitions for integration into Cypher within two weeks.
    • Jacqueline will share a draft of the multi-library integration paper in early August.
    • Azza will work on integrating Phenotype February definitions into the library.
  • Meeting Schedule: The August 22nd meeting is canceled due to vacations. The group will explore meeting on September 5th or 12th.

This is a cross post. [CONFIRMED - HYRBID and IN PERSON] OHDSI Symposium 2025 - Phenotype Development and Evaluation Work Group - Call for Collaboration: Collaborative Intelligence – Humans and AI in Concept Set Development in Phenotyping - #14 by Gowtham_Rao

Today 9am EST OHDSI Phenotype development and evaluation WG - 9am to 10:00am est [q1M - updated July 2025] | Meeting-Join | Microsoft Teams

Dear OHDSI Community, today is the last meeting prior to the OHDSI Global Symposium 2025. We have a lot to plan and we need your input. The logistics of this year is overwhelming; our risk of failure is high; we will walk thru the details. We also need to make concrete progress on clinical descriptions.

As detailed extensively on the OHDSI forums, the Phenotype Development and Evaluation Workgroup is organizing a structured scientific work session: “Collaborative Intelligence: Humans and AI in Concept Set Development in Phenotyping.” This “Minds Meet Machines” challenge is designed as a rigorous, hypothesis-driven experiment to formally evaluate the performance of Generative AI pipelines against expert manual curation using a multi-arm, blinded, randomized comparative design with crossover adjudication.1 The execution of this live, recorded experiment involves complex logistics, as outlined in the detailed “Run-of-Show.” Today’s planning meeting focused on finalizing the experimental protocol and addressing critical operational risks.

Refining Experimental Scope for Rigor

We discussed the scope of the experimental arms relative to the time allocated for Phase 1 (Human Baseline Generation). While the initial proposal identified up to 28 potential phenotypes, prior experience within the OHDSI community underscores the significant time required to develop high-quality concept sets and manage the consensus-building process (Modified Delphi method). To ensure methodological rigor and the clean, complete execution of the experiment within the strict timeframe of the workshop, we are confirming the decision to strictly limit the scope to one phenotype per table/expert lead. This focus on quality over quantity is essential for maintaining the validity of the human baseline.

Immediate Action Required: Finalizing Clinical Descriptions

For the confirmed Clinical Leads, we must now finalize the standardized inputs for the experiment. We require a concise, “fully specified” clinical description for your selected phenotype. These descriptions should be approximately two lines. To maximize the scientific relevance of this evaluation, we strongly prefer clinical ideas that are currently undergoing investigation in an actively recruiting Phase 3 clinical trial as they are of interest of the life science researchers (pharmaceutical companies) in our community. Finalizing these descriptions immediately is crucial for preparing the standardized input packets that will be provided simultaneously to the human and AI teams.

Technology Platform Update: Adjudication Tool

A critical dependency for the Phase 2 blinded adjudication is the software platform required to present the concept sets (human-generated and AI-generated) in a standardized, blinded format, allowing teams to assess the “deltas” and capture scores using the standardized rubric. We extend our sincere gratitude to @gdravida09 , who is voluntarily building this crucial open-source OHDSI software application (GitHub - OHDSI/py_codelist_comparator: py_codelist_comparator is a Python app for the OHDSI "Minds Meet Machines" challenge, a live, blinded, randomized experiment comparing human vs. AI medical concepts. It has a high-performance data pipeline, a scalable adjudication interface for 100+ users, and a real-time visualization dashboard to ensure citable, scientific integrity. ) from scratch specifically for this experiment. Gaurav, thank you for undertaking this critical development effort. Could you please provide the workgroup with an update on the development progress? We are particularly interested in the platform’s readiness for the upcoming technical dry runs, which are vital for mitigating risks during the live data extraction and blinding window.

Critical Risk Assessment: AI Arm Recruitment

The most significant risk to the scientific integrity of this multi-arm study is insufficient participation in the “Machine” arm. A robust evaluation depends on comparing human expertise against multiple, distinct AI pipelines. We recently conducted an extensive search and issued an “Open Challenge to AI Innovators,” directly inviting leaders across academia, open-source projects, and industry. Despite this broad and direct outreach, we have not yet received the necessary commitments. It is unclear why the response has been limited, and we welcome the community’s perspective on this challenge. Are the experimental stipulations—such as the requirement for autonomous execution without “human-in-the-loop” refinement, and the need for methodology transparency (e.g., prompts, system settings)—proving to be barriers? Is there a reluctance by innovators to participate in a rigorous, blinded, head-to-head comparison focused strictly on scientific validity? Maybe Real AI has nothing to fear and AI is just a hype, or maybe we make it blinded – so early startups do not feel like they are becoming vulnerable if their quality is scored lower. We reiterate the open call and urge the OHDSI community to help facilitate connections with potential AI collaborators. The success of this landmark evaluation relies on bringing these innovations together.

This initiative represents a significant effort to generate high-quality, peer-reviewable evidence regarding the integration of AI into the phenotyping lifecycle. We appreciate your immediate attention to these critical items.

Sincerely,

On behalf of the OHDSI Phenotype Development and Evaluation Workgroup Organizing Leads

(Gowtham A Rao MD, PhD; Azza A Shoaibi PhD; Joel Swerdel PhD; Jack Murphy MPH, PhD)

This is a cross post [CONFIRMED - HYRBID and IN PERSON] OHDSI Symposium 2025 - Phenotype Development and Evaluation Work Group - Call for Collaboration: Collaborative Intelligence – Humans and AI in Concept Set Development in Phenotyping - #14 by Gowtham_Rao

Planning discussion
Youtube video is at Phenotype Development and Evaluation workgroup Youtube Channel here

This is a detailed summary of the OHDSI Phenotype Development and Evaluation Work Group meeting held on September 26, 2025. The meeting focused on the urgent planning and logistical coordination for the “Minds Meet Machines” challenge, scheduled for the OHDSI Global Symposium on October 9, 2025.

Overview: The “Minds Meet Machines” Challenge

The workgroup is organizing a four-hour, live scientific experiment designed to rigorously compare the performance of AI (specifically Large Language Models - LLMs) against expert human curation in developing phenotype concept sets.

The experiment utilizes a multi-arm, blinded, randomized, controlled design. Human teams will generate concept sets live, while AI-generated concept sets will be produced prior to the symposium using the same standardized inputs. The outputs will then be blindly adjudicated.

Gowtham Rao opened the meeting by stating that planning is “close to being in the red area,” emphasizing the complexity of the undertaking and the need to accelerate execution with the symposium less than two weeks away.

Key Discussion Points, Decisions, and Risks

1. Experimental Design and Logistics

Structure: The event will feature several tables (estimated at four, based on confirmed clinical leads), each focusing on one specific clinical idea (e.g., Systemic Lupus Erythematosus led by Dr. @Christopher_Mecoli ). There are approximately 48 in-person registrants, yielding an estimated 30-35 active participants.

Technology: A dedicated, stable ATLAS instance will be deployed. It will be active during the one-hour human generation phase and immediately locked (made read-only) afterward.

  • Decision: The ATLAS instance will use the latest version with PHOEBE and concept prevalence.
  • Action Item: The specific vocabulary version must be confirmed (with Patrick Ryan and Konstantin) to ensure consistency for both human and AI arms.

Participant Assignment: To ensure fairness and balance expertise, stratified randomization will be used.

  • Decision: Participants will be surveyed a week prior to rank their confidence and skills in informatics/tooling, allowing organizers to ensure a blend of senior and junior expertise at each table.

Data Capture: Human deliberations during both generation and adjudication will be audio-recorded (with consent) for qualitative analysis of decision-making processes.

2. AI Participation (Critical Risk)

A major concern is the low number of confirmed AI participants, which threatens the statistical power of the study.

  • Status: Only two AI workflows are confirmed: J&J (Joel Swerdel’s script) and EPAM (Prometheus, led by Daria Zhukova).
  • Mitigation: Gowtham Rao has conducted extensive outreach to academic and industry innovators. To address hesitation regarding public comparison and potential negative exposure, the option for blinded (anonymized) participation has been formally introduced, allowing teams to benchmark their tools confidentially.

3. Standardization and Inputs

Standardized Clinical Descriptions: These inputs must be finalized one week before the symposium.

  • Decision: Azza Shoaibi will propose a standardized input format, which must include the intended utility of the phenotype. These inputs must be signed off by the clinical leads in advance to ensure stability during the live session.
  • Action Item: A dedicated training/alignment call will be organized for Clinical Leads.

AI Generation Timing: AI concept sets will be generated before the event. This ensures fairness by decoupling algorithmic quality from computational infrastructure limitations (e.g., academic vs. industry resources).

4. Adjudication and Evaluation Methodology (Major Discussion)

The most debated topic was how to define the “Gold Standard” and measure performance.

Blinding and Bias: Adjudication uses a crossover design (participants evaluate sets they did not create). However, a potential bias was identified: Clinical Leads, having overseen the human generation, will inherently recognize that set during adjudication.

  • Mitigation: Clinical Leads must be strictly instructed to act only as neutral subject matter experts during adjudication and not advocate for any specific codelist.

The “Gold Standard” Debate: The group debated the validity of using a pre-defined “Target Standard” (created in advance by organizers Gowtham, Azza, and Patrick) as the ultimate benchmark, as this might bias results toward those experts’ views.

  • Jack Murphy’s Proposal (Target Standard as Benchmark): Jack proposed a rubric combining:

    1. F1 Score (Precision/Recall): Automatically calculated against the pre-defined Target Standard.
    2. Structural Proficiency Score: Human-adjudicated score evaluating the efficiency and parsimony of the concept set expression (e.g., appropriate use of hierarchies).
  • Azza Shoaibi’s Counter-Proposal (Adjudicated Gold Standard): Azza proposed a more rigorous approach where the Gold Standard is derived from the experiment. All concepts generated by Humans, AI, and the Target Standard would be pooled. Concepts agreed upon by all are accepted; all discrepancies (deltas) are adjudicated. The final, consensus-driven list becomes the “True Gold Standard,” against which all initial attempts are then measured.

  • Decision: The evaluation methodology remains unconfirmed. A dedicated meeting is required to finalize the approach.

Vocabulary Handling: The group discussed how to compare concept sets if AI tools submit non-standard vocabularies (e.g., ICD-10 only).

  • Decision: The evaluation tool (developed by Gaurav Dravida) will facilitate like-for-like comparisons within the submitted vocabulary (e.g., comparing AI ICD-10 codes only against Human ICD-10 codes).

Run of Show (October 9th)

The four-hour session is tightly scheduled, detailed in the OHDSI forum post:

  • 8:00-8:45 AM: Introduction and Standardization Training (including critical ATLAS naming conventions).
  • 8:45-9:45 AM: Phase 1: Human Baseline Generation (Live).
  • 9:45-10:30 AM: Transition & AI Overview (“The Critical Window”). While AI leads present their methodologies, the backend team executes the high-risk task of extracting, blinding, and loading the human-generated data into the adjudication tool.
  • 10:30-11:45 AM: Phase 2: Blinded Comparative Adjudication (Crossover design).
  • 11:45 AM-12:00 PM: The Reveal (Unblinding and preliminary results).

Action Items

  1. Finalize Evaluation Methodology: Azza to schedule a dedicated meeting (including Jack Murphy, Martijn, and Patrick Ryan) to finalize the Gold Standard definition and adjudication process.
  2. Standardized Inputs: Azza to circulate the template for standardized clinical descriptions.
  3. Clinical Lead Coordination: Gowtham to organize the training/alignment call with Clinical Leads and confirm any outstanding leads.
  4. Vocabulary Version: Gowtham/Azza to confirm the specific vocabulary version for the ATLAS instance with Patrick Ryan.
  5. Participant Stratification: Organizers to survey participants on their confidence/skills for stratified randomization.
  6. AI Recruitment: Continue aggressive outreach for AI participants, emphasizing the anonymous option.
  7. Technical Development: Gaurav and the backend team (Sajan) to continue developing the adjudication tool and data pipeline, adapting to the finalized evaluation methodology.
  8. Target Standard Creation: Gowtham, Azza, and Patrick to create the “Target Standard” concept sets next week.

Mind Meets Machine YouTube Video

Part 1: https://youtu.be/igTQC4PkiCA

Part 2: https://youtu.be/7Ek3vF3Pu_E

Part 3: https://youtu.be/24FnC9FbaQU

Title: Mind Meets Machine Workshop Recap: A Scientific Evaluation of AI vs. Human-Led Concept Set Generation

The Phenotype Development and Evaluation Work Group convened a workshop, “Mind Meets Machine,” during the OHDSI 2025 Symposium. The session, co-led by @Azza_Shoaibi and @Gowtham_Rao, executed an informal exercise designed to address a critical question facing the OHDSI community: How do emerging Generative AI/LLM approaches for concept set generation compare to established human-led workflows?

This workshop directly supports the working group’s mission “to improve the quality and the reliability of the evidence we generate from observational data by advancing the science of phenotype development.” The goal was to scientifically evaluate the accuracy, completeness, and precision of these new tools before they are adopted into standard observational research processes.

The session began with a moving tribute to the late Jamie Weaver, honoring his significant contributions to the science of phenotyping and measurement error, setting a mission-driven tone for the day’s activities.

The Experiment: Design and Execution

The primary objective of the experiment, operating under a OHDSI QI project, was to compare the performance of Gen AI workflows against rigorous, consensus-based human workflows. The primary metric for evaluation will be the prevalence-weighted F-score.

The experiment involved several key phases:

  1. Human Workflow: Over 20 participants were randomized to different clinical ideas (e.g., SLE, DME, DVT, RA). They were given a strict 30-minute time limit to generate concept sets within a dedicated Atlas instance, based on provided clinical descriptions.
  2. AI Workflow: Four distinct Gen AI-driven methodologies, developed by community researchers, were submitted for the comparison.
  3. Gold Standard Creation: Recognizing that neither human consensus nor AI output constitutes a definitive ground truth, a “dynamically created gold standard” was established. Clinical experts for each disease area adjudicated codes where the human teams and AI workflows disagreed. The final gold standard was defined as the intersection of all human and AI codes, plus any disagreed codes validated by the adjudicators.

Showcase: Diverse AI Methodologies

A key component of the workshop was a showcase of three distinct AI approaches submitted for the evaluation, demonstrating the diversity of methodologies being explored in the community:

  • EPAM Systems (Presented by @darya.zhukova ): This approach utilizes a containerized architecture with a vector store for semantic searching (using AWS Bedrock). A key feature is user-controlled precision, allowing users to adjust the breadth of the search from exact matches to broad vector similarities, offering flexibility in balancing sensitivity and specificity.
  • King’s College (Presented by @Niko_Moller-Grell ): This research-focused “agentic workflow” breaks down the complex process of concept reasoning into smaller sub-tasks. It employs a hybrid approach, combining semantic similarity (vector search) with ontological reasoning by traversing the OMOP vocabulary relationships (knowledge graph) for finer-grained decision-making and sanity checking.
  • JNJ (Presented by Joel Swerdel): This open-source tool utilizes a unique two-stage process. First, it leverages the PHOEBE recommender system (along with descendants of a starting concept) to generate a broad list of candidates. Second, an LLM adjudicates each candidate concept using specific “proportional logic” (e.g., asking if >95% of patients with the candidate concept also have the target condition).

Key Reflections and Scientific Challenges

Following the concept set generation and adjudication activities, the working group reflected on the process and preliminary observations, highlighting several fundamental challenges in the science of phenotyping:

1. Significant Variability and the Challenge of Specificity
Initial findings revealed stark differences depending on the clinical idea. For Rheumatoid Arthritis (RA), there was a surprising 94% consistency between human and machine-generated codes. In contrast, Deep Vein Thrombosis (DVT) showed less than 24% overlap.

Workshop attendees noted that the specificity of the study question dramatically impacted the task. Defining “acute proximal DVT” proved far more difficult than chronic RA, as available clinical codes often lack the necessary granularity. This creates a tension between adhering to a narrow clinical specification and the reality that significant record counts often reside on more general, ambiguous codes.

2. The “Source Problem”: Clinical Practice vs. Research Needs
A major theme of the discussion was the fundamental disconnect between how data is generated in clinical practice and the needs of observational research. Clinicians emphasized that coding in practice is driven primarily by billing and clinical operations, not research precision. This systemic gap means the raw material for phenotyping (diagnosis codes) is often not generated with research-grade precision, complicating all downstream efforts.

3. The Importance of Context and Data
The group emphasized that concept set creation cannot be divorced from the broader study design and the underlying data. Workshop attendees expressed the need to see record counts to determine the relevance of esoteric codes and noted that strategies for inclusion/exclusion depend heavily on the intended cohort logic (e.g., looking for DME in an already defined diabetic population allowed for a more inclusive approach).

Next Steps and Future Vision

The immediate next step is the formal analysis of the exercise, calculating the F-scores for each human team and AI workflow against the adjudicated gold standard. The results will be shared with the community.

Looking ahead, the working group stressed that this exercise is a stepping stone. The community must move beyond evaluating concept sets in isolation. As @Azza_Shoaibi and @JudyRac (Dr. Judith Racoosin) noted, OHDSI does not view phenotypes merely as code lists. Conceptual debates without data are often unproductive, and the ultimate validation requires running different design choices in the data to see if they affect the final patient cohort.

The goal for Phenotype February 2026 is to advance this work by having both humans and AI build complex, data-driven cohort definitions for meaningful clinical ideas, which can then be robustly evaluated across the OHDSI data network.


We extend our gratitude to the organizing team, the AI development teams, the logistical support from Will Kelly (JHU, John Hopkins University), the clinical adjudicators—Dr. @Evan_Minty (DVT), Dr. @Christopher_Mecoli (Lupus/Systemic Sclerosis, John Hopkins University), and Dr. @cindyxcai (DME, John Hopkins University), Dr. @briantoy (Posterior Uveitis, University of Southern California USC Schaeffer Institute), Dr @Liz_Park (Rheumatoid Arthritis, Columbia University)—and all workshop participants for their contributions to this vital research.

Workgroup updates

Minds Meet Machines: A Comprehensive Report on the OHDSI Phenotyping Challenge

The “Minds Meet Machines” event saw high engagement, bringing together over 50 in-person participants and approximately 50 online attendees, including clinical experts, researchers, and informaticians.

The “Minds Meet Machines” challenge is now moving into the formal analysis and dissemination phase.

Current Status (October 24th 2025):

Validation Plan:
The analysis code is undergoing a rigorous, multi-stage validation process before the final results are generated Add Full MMM Challenge Analysis Pipeline (Phase 1–2) Implementing SAP by ghatesudi · Pull Request #7 · ohdsi-studies/MindMeetsMachines · GitHub

Timeline and Future Goals:

  • The study team is targeting the end of October 2025 to produce a draft of the scientific paper detailing the study’s methodology, comprehensive findings, and implications for the OHDSI community.
  • The findings from this challenge will be used to inform the design of a more advanced challenge for Phenotype Phebruary 2026. This future initiative may expand the scope of the evaluation from generating concept sets to developing full, executable cohort definitions.

The Phenotype Development Evaluation Workgroup 2026 OKR:

Objective 1: Advance the Science of AI-Assisted Systematic Phenotyping (shared with AI work group)
KR 1.1: Conduct Phenotype Phebruary with an objective to help benchmark an iterative, empirically grounded, AI‑assisted workflow - across diverse RWD network sources by Q1 2026.
KR 1.2: Finalize and submit the “Minds Meet Machines” manuscript to a high-impact informatics journal by Q1 2026
KR 1.3: Develop a gold standard for phenotype algorithms for a specific data source that can be used to evaluate phenotype development or evaluations pipelines by Q3 2026.
KR 1.4: Develop, test, and validate a robust AI-assisted pipeline for phenotype and development of clinical phenotypes by September 2026.
KR 1.5: Populate the Phenotype Library with >= 100 new phenotypes using AI and demo the learning in the global symposium

Objective 2: OHDSI Library integration into CIPHER

KR 2.1: Complete integration into CIPHER By Q3 2026.

KR 2.1.: Publish a peer-reviewed communication establishing Inter-Library Metadata Standards (lead VA Cipher) to ensure high-fidelity findability and cross-network reproducibility By Q2 2026.

KR 2.2: Refactor the Phenotype Library Submission Architecture to mandate adherence to the new metadata ontologies. By Q4 2026

1 Like

OHDSI Phenotype Development Evaluation Workgroup meeting held on January 29, 2026.


1. Administrative Updates

  • New Cadence: The workgroup has officially moved its meetings to the last Thursday of every month (9:00–10:00 AM ET). This change was made to better accommodate global collaborators.
  • New Members: The group welcomed newcomers from the Netherlands, Indonesia, and India, reflecting the community’s expanding global reach.

2. 2026 Objectives and Key Results (OKRs)

The workgroup finalized a roadmap for 2026, pivoting heavily toward machine-assisted methodologies.


3. Phenotype Phebruary 2026 Workflow

The group detailed a live experiment starting in February to test the new iterative workflow.

  • Iteration and Feedback: Participants will submit phenotype definitions for a community-voted condition. These will be evaluated using Keeper, an LLM-enabled tool that abstracts patient data for adjudication.
  • Error Metrics: Submitters will receive performance metrics (e.g., PPV and sensitivity) and patient profiles to refine their definitions in a “human-machine feedback loop”.

4. The “Gold Standard” Debate

A significant discussion occurred regarding what constitutes a “gold standard” for evaluation.

  • Rubric vs. Patient Set: Members debated whether the gold standard should be a set of validated patients (similar to chart reviews) or a performance rubric.
  • PIQI Framework: Ben Hamlin introduced the PIQI (Performance Improvement and Quality Infrastructure) framework, recently adopted by the VA, as a potential model for standardizing these performance expectations.
  • Context Sensitivity: Sima Mohammadi noted that gold standards are often data-source specific, meaning a definition that works in primary care might not be the gold standard for claims data.

5. Follow-Up Action Items

Task Assignee Deadline
Phenotype Phebruary Voting: Launch poll for condition selection (e.g., MI vs. Rheumatology). Azza/Gowtham February 1
OKR Submission: Finalize reworded OKRs and submit to Steering Committee. Azza/Gowtham Immediate
Manuscript Review: Participants to review and add co-author names to the “Minds Meet Machine” draft. All Contributors Two Weeks
March Planning: Prepare a session dedicated to the “Gold Standard” definition. Workgroup Leads March Call

:memo: Phenotype Development and Evaluation Workgroup - Feb 26, 2026 Meeting Recap

Executive Summary
The February 26th meeting of the Phenotype Development and Evaluation Workgroup focused on finalizing the ambitious 2026 OKRs, navigating recent infrastructure outages, and planning the upcoming AI-assisted Phenotype Challenge (Phenotype Phebruary - aPHril). Key scientific discussions centered on the successful “Minds Meet Machine” study results and the critical, ongoing debate on how to establish a definitive gold standard for phenotype algorithm evaluation.


1. 2026 Objectives and Key Results (OKRs)

The workgroup has officially aligned on two primary objectives for the year:
*Objective 1: Advance the Science of AI-Assisted Systematic Phenotyping, a strategic goal shared directly with the AI work group.
*Objective 2: Ensure robust OHDSI Library integration.

2. “Minds Meet Machine” Publication

  • Scientific Results: The study’s empirical results have been finalized and pushed to the public repository. Notably, one of the tested AI workflows (utilizing PHOEBE alongside probabilistic LLM prompting) successfully demonstrated statistical non-inferiority compared to human clinical coders across all six tested phenotypes.
    *Publication Goal: The primary objective is to finalize and submit the manuscript to a high-impact informatics journal by Q1 2026.
  • Logistics: Due to recent collaborative environment outages that temporarily restricted document access, the clean, fully edited manuscript will bypass the shared servers and be distributed directly via email for final institutional approvals.

3. Establishing a Gold Standard for Evaluation

*The Goal: Develop a gold standard for phenotype algorithms for a specific data source by Q3 2026.This is a prerequisite to reliably evaluate independent phenotype development or evaluation pipelines.

  • Scientific Debate: A significant discussion focused on whether this standard should rely on traditional human chart reviews via direct EHR access, or if the network should leverage AI-extraction tools (like KEEPER) combined with LLM adjudication.
  • Data Privacy & Compute Constraints: Sharing patient-level data with external commercial LLMs introduces massive legal and compliance friction. To mitigate this, the working strategy leans toward standardizing a federated prompt workflow so data partners can run open-source models securely behind their own institutional firewalls. @rkboyce

4. Phenotype Challenge (Shifted to April)

*The Objective: Conduct the annual Phenotype Phebruary to help benchmark an iterative, empirically grounded, AI-assisted workflow collaboratively across diverse RWD network sources by Q1 2026. The timeline has been shifted to April to accommodate complex technical integrations. @Azza_Shoaibi

  • The Workflow: The community will submit candidate phenotype definitions. These will be evaluated locally using extraction tools and LLM agents to estimate Positive Predictive Value (PPV) and sensitivity. Participants will receive targeted feedback on false positives/negatives, allowing them to iterate and improve their algorithms.
    *Long-Term Impact: This challenge is a stepping stone to develop, test, and validate a robust AI-assisted pipeline for clinical phenotypes by September 2026.Ultimately, the goal is to populate the Phenotype Library with 100 or more new phenotypes using AI and demonstrate these learnings at the global symposium.

5. OHDSI Library Integration Updates

*Metadata Standardization: The group is on track to publish a peer-reviewed communication establishing Inter-Library Metadata Standards by Q2 2026.Led by VA Cipher, this aims to ensure high-fidelity findability and cross-network reproducibility @jhonerlaw .
*Architecture Refactoring: By Q4 2026, the Phenotype Library Submission Architecture will be fully refactored to explicitly mandate adherence to these new metadata ontologies.


:dart: Action Items & Next Steps

  • Manuscript Distribution: Route the finalized “Minds Meet Machine” paper to all co-authors via direct email to clear final institutional approvals.
  • Next Meeting Agenda: April Phenotype Challenge.

Slides 20260326.pptx

OHDSI Phenotype Workgroup Meeting Agenda
Date: March 26, 2026

1. Publications & Frameworks

  • Mind Meets Machine Update: Status check on the final manuscript distribution, institutional approvals, and target timeline for journal submission. @Azza_Shoaibi
  • Metadata Framework Panel Submission: Update on the panel proposal status and next steps for the framework communication. @jhonerlaw

2. Phenotype Library & Tooling Development

  • New Phenotype Library Release: Overview and walkthrough of the latest library updates and additions. @sangitha_bhat
  • Cipher Extraction Update: Progress report on the integration and data extraction workflows. @Gowtham_Rao
  • Librarian Agent Showcase: Update on the LLM “librarian agent” operating on top of the phenotype library. @rkboyce

3. Phenotype Challenge Planning

  • Phenotype Phebruary – aPHril Edition:
    • Gold Standard Methodology: Finalizing the approach for KEEPER/LLM-based adjudication vs. traditional chart reviews. @Azza_Shoaibi
    • Clinical Use Case: Confirming the target condition (e.g., Acute Myocardial Infarction) and securing our clinical expert partner. @Gowtham_Rao
    • Launch Logistics: Final prep for opening the community challenge next month. @Gowtham_Rao @Azza_Shoaibi

Phenotype Development and Evaluation Workgroup - March 26, 2026 Meeting Recap

Leads: Gowtham A. Rao MD, PhD; Azza A. Shoaibi, PhD

Video https://youtu.be/ri-e8cTyJXQ

1. 2026 Objectives and Key Results (OKRs)

The workgroup reviewed the strategic roadmap for the year, focusing on two primary objectives:

Objective 1: Advance the Science of AI-Assisted Systematic Phenotyping (Shared with AI Workgroup)

  • KR 1.1: Benchmark an iterative, empirically grounded, AI-assisted workflow across diverse RWD network sources (Phenotype Phebruary/aPHril) by Q1 2026.
  • KR 1.2: Finalize and submit the “Minds Meet Machines” manuscript to a high-impact informatics journal by Q1 2026.
  • KR 1.3: Develop a gold standard for phenotype algorithms for a specific data source to evaluate independent pipelines by Q3 2026.
  • KR 1.4: Develop, test, and validate a robust AI-assisted pipeline for phenotype development by September 2026.
  • KR 1.5: Populate the Phenotype Library with ≄ 100 new phenotypes using AI and demonstrate learnings at the global symposium.

Objective 2: OHDSI Library Integration

  • KR 2.1: Publish a peer-reviewed communication establishing Inter-Library Metadata Standards (led by VA Cipher) to ensure findability and reproducibility by Q2 2026.

  • KR 2.2: Refactor the Phenotype Library Submission Architecture to mandate adherence to new metadata ontologies by Q4 2026.

    2. Publications & Frameworks

  • “Mind Meets Machine” Manuscript: Content is officially complete following the addition of the final AI arm descriptions. The goal remains submission to a high-impact journal by the end of Q1 2026. @Azza_Shoaibi

  • Cipher & VUMC Integration: Implementation of the metadata framework is being targeted for Cipher and Vanderbilt (VUMC) to ensure cross-network reproducibility. A draft for this implementation paper is expected by early summer. @jhonerlaw

  • Metadata Framework: A panel proposal has been submitted for the November AMIA symposium. Work continues on the foundational paper for library harmonization, currently in “minor revisions” at JAMIA Open. @Honerlaw_Jacqueline

    3. Phenotype Library & Tooling Development

  • New Library Release (v3.37.0): Sangitha Bhat walked the group through the first major refresh since April 2025. This includes Rupa Makadia’s validated pregnancy algorithms (Miscarriage, Ectopic Pregnancy, Stillbirth, Live Birth) defining clinical “eras.” @Sangitha_bhat @Rupa_Makadia

  • Cipher Extraction & Integration: Progress was reported on the automated extraction of OHDSI definitions for the Cipher library, improving global asset discoverability. @Gowtham_Rao

  • Librarian Agent Showcase: Richard Boyce demonstrated an LLM-based “Study Agent” using the Model Context Protocol (MCP) to retrieve and recommend phenotypes based on specific study intents. @rkboyce

    4. Phenotype ‘aPHril’ Challenge Prep

  • The Objective: Benchmarking the AI-assisted workflow collaboratively across diverse RWD sources. Launch is set for March 31st.

  • Gold Standard Methodology: Significant debate continues regarding LLM adjudication vs. traditional human chart review. The group is leaning toward an AI-extraction (KEEPER) combined with LLM adjudication approach to handle scale. @Azza_Shoaibi

  • Clinical Use Case: Acute Myocardial Infarction (AMI) is confirmed as the target condition, utilizing existing definitions from Legend and COVID-AESI as benchmarks. @Gowtham_Rao

  • Launch Logistics: Final preparations for the community challenge are complete, with a four-week workflow beginning next week.

    5. Oncology Data Partnership

  • New Collaboration: with Daniel Smith (Emory/Winship Cancer Institute). The data source has deep oncology data to test rule-based and interest in agentic phenotyping using the latest oncology extensions.

    Action Items & Next Steps

  • Manuscript Submission: Azza Shoaibi to finalize the “Mind Meets Machine” submission by the Q1 deadline.

  • Challenge Launch: Gowtham Rao and Azza Shoaibi to open the Phenotype ‘aPHril’ challenge during the community call on March 31.

  • Metadata Poll: Jackie Honerlaw to distribute a poll regarding terminology disagreements in the metadata framework.

  • MCP Registry: Developers to coordinate with Richard Boyce on registering new MCP tools for the Librarian Agent.

Good morning - ⁠We unfortunately need to cancel today’s meeting. Both @Azza Shoaibi and I have conflicts. Please feel free to use the MS teams channel for any updates.

Soon we will be start the planning for this years Global symposium! Last year we did the Mind Meets Machine, that has a paper in circulation. The paper is currently here JAMIA_Original_Article_MindMeetMachine_051526.docx . If you were at the MMM please contact me and Azza directly as we are finalizing authors. Authors.xlsx

Gowtham Rao MD, PhD