Phenotype Phebruary 2026

Dear OHDSI Community,

It’s finally arrived. That wonderful month when you can put all your troubles aside, cast off those New Year’s resolutions you’ve already failed at, enjoy the freezing cold Northeastern US weather or the Australian heat, and JUST FOCUS ON PHENOTYPING!

We are excited to launch Phenotype Phebruary 2026. Another ambitious community experiment designed to advance how we collectively build, evaluate, and refine phenotypes at scale.

28 days. ONE phenotype. An end‑to‑end, iterative development and evaluation challenge.

That’s our target for OHDSI in 2026. Are you ready?

:microscope: What’s New This Year

Over the last decade, OHDSI has built a tool enabled structured workflow to support phenotype development and evaluation—Atlas, CohortDiagnostics, KEEPER, PheValuators. Then last year in the OHDSI Symposium, we did the Minds Meet Machines* when this workflow was challenged by AI!

And because this is the era of the reasoning models—and LLMs have shown promise to transform phenotype development—this year we’re introducing The Phenotype Challenge, a collaborative experiment to test an iterative, empirically grounded, AI‑assisted workflow.

We’ll explore whether we, as a community, can develop phenotypes through a cycle of:

  • Development
  • Evaluation
  • Error analysis
  • Refinement
  • Re‑evaluation

All with help from LLM‑enhanced KEEPER, multiple data sources, and robust diagnostics.

:checkered_flag: Challenge Workflow

1. Vote on the Condition

We will begin with a community vote between two candidate conditions. The winning condition will be announced immediately after the poll closes.

2. Submit Your Initial Phenotype (Feb 1–13)

  • All collaborators are invited to submit their best phenotype definition(s), using any method—rule‑based, ML‑based, hybrid—so long as it conforms to the OMOP CDM. You can submit more than one if it’s needed (e.g. specific, sensitive).
  • Deadline: Friday, February 13.

3. Evaluation & Diagnostics (Feb 12–24)

We will evaluate all submitted definitions using:

  • LLM‑enabled KEEPER to estimate PPV and sensitivity (phevaluator results will also be shared if possible)
  • At least three observational data sources (Optum DOD, Optum EHR, JMD)
  • A CohortDiagnostics app with the implementations across the same data sources

Collaborators will have access to diagnostics and profiles to analyze performance and identify error sources.

4. Iteration #1 — Second Submission Due Feb 25

Submit your refined definition by Wednesday, February 25.
We will re‑evaluate all updated submissions with the same process.

5. Iteration #2 — Final Submission Due Mar 3

Submit your final definition by Tuesday, March 3.
Final performance results will be announced mid‑March, including:

  • Best overall performance
  • Most improved
  • Most generalizable

:people_holding_hands: Working Group & Community Calls

Phenotype WG (Tuesday 9am EST)

  • Feb 3: Support for LLM enabled literature review for existing phenotypes+ phenotype development in ATLAS + GenAI concept set generation (based on the learnings of minds meet machines)
  • Feb 10: Discussion of evaluation results, diagnostics, and iteration strategies-*Clinical partners/collaborators’ participation will be very valuable in this session.
  • Feb 17 & Feb 24: Optional office hours for iteration support

OHDSI Community Calls (Tuesdays)

  • Feb 17: Update on submissions and early insights
  • Feb 24: How KEEPER profiles reveal phenotype error sources
  • Mar 3: Wrap‑up and reflections

:page_facing_up: Submission Requirements

Please include:

  • An OMOP‑CDM‑compliant phenotype definition (ATLAS JSON preferred)
  • High level description of your development method (any approach allowed, including ML)
  • Able run on participating data sources platforms

Note: Data partners will run KEEPER locally. No patient profiles will be shared.

:hospital: Data Partners: We Need You

If you operate an OMOP CDM instance, we encourage you to:

  • Run submitted definitions on your data
  • Share CohortDiagnostics results
  • Execute KEEPER on your own data with manual review (no LLM)
  • Execute LLM‑enabled KEEPER locally

Your participation strengthens generalizability and enriches community learning.

:tada: Let’s Build the Future of Phenotyping

Phenotype Phebruary 2026 is our opportunity to prototype a scalable, systematic, and AI‑enhanced phenotype development workflow—together.

We invite all collaborators to submit definitions, iterate with us, and learn from this community-wide experiment.

More details—including the condition poll and submission instructions—will be shared shortly on the OHDSI Forums

Let’s make Phenotype Phebruary 2026 our most innovative and impactful yet!

Warm regards,
The OHDSI Phenotype Workgroup

*Minds meet machine drafted manuscript:

For those who participate or contribute to minds meet machine- or those who just want to see the first draft, here is the drafted manuscript and related artifacts. Please feel free to edit and add your author information into the authorship table if you meet author criteria.

Azza Shoaibi PhD
Gowtham Rao MD, PhD

1 Like

Good evening everyone – Phenotype Phebruary is postponed until further notice :frowning:

Can’t have meetings. Do not have your emails. Do not have access to the workgroup’s MS tenant space :frowning:

Not to worry – this gives us more time to think through the problem space of phenotyping. I am going to finally prioritize getting the Mind Meets Machine paper to done. If you are interested, we can use the forums to discuss any conceptual or theoretical ideas you may have.

I have been thinking hard about how a Proposer-Validator reasoning* framework using LLMs can help with phenotyping. Given a target clinical idea, can one LLM propose and another LLM validate a chain of optimization sequences performing recursive development and evaluation?

  • Note: I am not convinced that LLMs think - but I have accepted the word reasoning. :slight_smile:

I’m calling this concept the Agent Reasoning Framework (ARF) for Phenotype Development and Evaluation. Can LLMs perform systematic, reproducible phenotyping by working on the fundamental unit: an Event (the tuple {Person_ID, Event_Date})?

Here is how I envision this Proposer-Validator loop working across the phases:

  • Phase I: Maximizing Sensitivity
    The Proposer agent starts with an “Anchor” concept (e.g., a broad diagnosis code) and iteratively proposes expansion rules using Knowledge-Driven (RAG) and Data-Driven inferences. To validate these proposals, the Validator checks the “Marginal Set”—events captured only by the newly proposed rule—to determine if the individual events and group-level population characteristics are clinically similar to the base set. We apply a “Shock Expansion” constraint to ensure the phenotype doesn’t drift semantically (e.g., accidental expansion from a specific disease to generic pain).
  • Phase II: Maximizing Specificity
    The framework treats every inclusion rule as a hypothesis. The Proposer suggests a rule, and the Validator assesses it by comparing the “Retained Events” against “Removed Events.” A test checks if the removed group is systematically distinct (Good Attrition) or phenotypically identical (Bad Attrition/Sensitivity Error).
  • Evaluation & Adjudication
    Event-level checks would utilize a KEEPER (Knowledge-Enhanced Evaluation of Phenotype Evidence using Reasoning) approach—an LLM-as-a-Judge framework that reviews longitudinal patient narratives to verify concordance between the clinical intent and the computed phenotype, serving as a ground truth for the Agent’s logic. Group-level statistics, including high-dimensional covariate diagnostics, can be used to compare event groups.

Curious to hear your thoughts on this “Event-Level” approach.

Hello all, newcomer here :wave:

This looks like an interesting use case for an Agentic-RAG pipeline with a human-in-the-loop approach.

I’m thinking out loud here. My initial thought is an iterative approach, starting with multiple AI agents to handle specific subdomains—
for example, three agents representing

  • clinical presentation
  • biological markers,
  • treatment responses.

These agents could iteratively map similar patterns and pass them to human experts for initial validation. From my personal experience, involving human domain experts in the middle of the pipeline leads to better performance than placing them at the end.
After initial human verification, the selected data can then be passed back to the AI agents for further curation.

I also think Graph-RAG would be a good fit here. :slightly_smiling_face:

1 Like

Re Phenotype Phebruary.

We still plan to accept submission for phenotypes for KEEPER assessment over network of data sources. If you want to submit your cohort definitions, please let the community know by posting on this thread. I will DM you.

Right now, we are thinking of a sub-type of Myocardial Infarction or Liver Injury. I have not written the target clinical idea/description yet.

If you are interested in submitting cohort definitions for network based KEEPER evaluation - please write below.

Apologies, we wont be using our traditional ways to communicate i.e. MS teams or emails because of the reasons above.

Gowtham and Azza

I agree with the iteration approach. But what do you think should be the optimization target for the iteration?

My thoughts are - we first optimize for maximum sensitivity, without dropping specificity/PPV below a certain threshold; and then once sensitivity has been maximized - we optimized for specificity/PPV. The sensitivity is addressed in the entry-event portion of cohort definition constructor (tool like Atlas), and specificity is addressed in the inclusion rule section of the cohort constructor.

@lasantha13 I personally would love, and I am sure the entire community would too, hear from your experience. Are you able to share your experience. I can set up a half hour call for us to meet and discuss.

I think one possible optimization target for the iteration should be semantic clarity rather than performance. In the early iterations, the sub-domain agents (clinical, biological, treatment) can work independently to deduplicate and normalize concepts within their own space. Then their outputs can be combined and re-normalized across agents to reduce overlap and make the signals more distinct.

This process can keep iterating as new agents are added. All of this can happen before any sensitivity or specificity evaluation, with the idea being to stabilize the phenotype representation first, before moving on to performance optimization.

I do not have prior experience with phenotype development or with the OHDSI ecosystem specifically. However, I was involved in the development of an LLM-based data extraction pipeline for a CDC program competition hosted by DrivenData in late 2024, where our team placed second.

At that time, AI-agent–based workflows were not yet common, so we did not use a fully agentic architecture. Instead, the pipeline combined LLM-based pathways with topic modeling approaches. We observed that including domain experts in the middle of the pipeline, rather than only at the evaluation stage, significantly improved both extraction accuracy and overall data processing performance.

You can see more details here . I’m not sure if this is relevant to your work, but I’d be happy to discuss it if it is of interest

That is a fascinating observation from your CDC challenge work. I agree that this is where our phenotyping tools need to go. You mentioned that keeping the expert “in the middle” was key—rather than just at the end for evaluation.

I’ve been thinking about this transition in terms of Intent-Based Programming, or perhaps more specifically for us, Intent-Based Phenotyping.

The analogy I like to use is Level 3 Autonomous Driving:

  • Current State (Level 0/1): We are “driving” manually. We search ATLAS for codes, we explicitly write the logic, and we handle every turn (inclusion rule) ourselves.
  • The Goal (Level 3): The human provides the intent (the destination), and the software does the driving (implementation). But—and this connects to your point—the human is still in the driver’s seat. We aren’t sleeping in the back; we are continuously monitoring the dashboard to ensure the car doesn’t drift.

In this future workflow, a user might declare a clinical intent: “Construct a cohort of New Onset T2DM patients initiated on Metformin, excluding secondary diabetes.”

The “agent” would then handle the implementation details—scanning the vocabulary, proposing concept sets, and writing the JSON/SQL. The human expert’s role shifts from authoring (picking codes one by one) to auditing and course-correcting. We would watch the “dashboard” (tools like CohortDiagnostics) and intervene only when the agent “drifts”—for example, if it accidentally grabs a code for ‘prediabetes’ or messes up a temporal constraint. The human in the loop can also monitor the impact of empirical assessment, in real-time, on sensitivity and/or specificity/PPV.

We definitely need to move away from the idea that AI just “does the work”. We need to Collaborate with it. We are active co-pilots.

Great to have you here for Phenotype Phebruary! Glad to hear your interests in OHDSI and Phenotyping.

I have been doing some experiments in this space - how can scientists and AI collaborate and create phenotypes that have the desired performance characteristics.

Yes, I totally agree. We need to be thoughtful about ‘where and when’ to use AI to get the maximum value from it, rather than completely depending on it. That’s where domain experts play a critical role in guiding and controlling the process.

I’m really interested in learning more about this and joining you all to do some interesting work in this space. Thanks!