Mind Meets Machines - OHDSI Symposium 2025 - Phenotype Development and Evaluation Work Group

[This message is also the notice of release of the concept sets for the challenge - GitHub - ohdsi-studies/MindMeetsMachines: The "Minds Meet Machines" Challenge. A concept set development study by the OHDSI Phenotype development and evaluation workgroup (AI Participants must submit their submissions by October 8th, 2025, 6:00 PM EST. You can email me directly).

Dear OHDSI Colleagues and Registered Participants,

MS Teams OHDSI meeting - Online option

We are excited about the enthusiastic response to the “Minds Meet Machines” challenge planned for the upcoming OHDSI Symposium by the Phenotype Development and Evaluation Workgroup on October 9th 8am to 12pm (walk-in registration welcome).

Content 2025 OHDSI Global Symposium - Phenotype workgroup

In our planning discussions, we have finalized the structure of this activity. It will be conducted as a hands-on workshop and Quality Improvement (QI) project aimed at evaluating and improving internal OHDSI methodologies for concept set development. (This project has been determined not to constitute Human Subject Research in collaboration with John Hopkins University.).

To analyze human collaboration, identify best practices, and evaluate the workflows, we will be performing systematic data collection. This includes gathering expertise data for team balancing and audio/video recording of team deliberations.

IMPORTANT NOTE ON CONSENT: Participation is voluntary. However, because the recordings will be analyzed and shared publicly (e.g., on YouTube and the OHDSI website) to disseminate findings, all participants must complete an Informed Consent and Media Release form upon arrival.

The core objective: we will conduct a collaborative QI activity aimed at comparing concept sets generated by human teams versus those from autonomous AI pipelines.

This post details the logistics and the final run-of-show for the workshop.

Prerequisites for Participants (IMPORTANT)

To maximize our limited time together, please ensure the following:

  1. Bring Your Laptop: This is essential. Participants in the human workflow arm must bring their own laptops to access the workshop’s dedicated ATLAS environment.
  2. Review Materials (Homework): The target phenotypes and clinical descriptions are specified ohdsi-studies/MindMeetsMachines, please review them. It is crucial that you review these materials before the workshop, as time during the session will be dedicated to building and analyzing, not reading background information.
  3. Vocabulary Standardization: To ensure consistency, all work (Human and AI) will utilize a standardized OMOP Vocabulary version released August 27th 2025.
  4. Informed Consent: You must review and sign the Informed Consent and Media Release form upon arrival to participate in the recorded sessions.

Workshop Agenda (October 9th)

The workshop runs from 8:00 AM to 12:00 PM.

(7:00 AM - 8:00 AM: Setup and Preparation - Workshop Leads/Volunteers Only)

8:00 AM - 8:30 AM: Welcome and Logistics

  • Introduction to the QI Project objectives and procedures.
  • Informed Consent Administration (At check-in and form submission).
  • Expertise Survey completion (Self-identification of Clinical and Informatics/Tooling proficiency).
  • Stratified Randomization (Team Assignment).

8:30 AM - 9:45 AM: Phase 1 - Human Concept Set Creation

The “Split and Reconcile” process begins. (Note: Deliberations will be audio- and video-recorded).

  1. Split: Teams split into subgroups.
  2. Independent Creation: Subgroups work independently in ATLAS. (The use of GenAI tools, such as ChatGPT, is strictly prohibited in this phase).
  3. Reconcile: Subgroups convene to negotiate one final concept set.
  • 9:45 AM: Pencils Down (ATLAS instance locked).

9:45 AM - 10:15 AM: AI Presentations

  • Coffee break for participants.
  • Short presentations from AI methodology leads (e.g., Darya, Joel).
  • (Backend: Technical team extracts human concept sets, merges with AI sets, blinds sources, and generates the adjudication lists (“The Delta”)).

10:15 AM - 11:30 AM: Phase 2 - Blinded Adjudication and Reflection

  1. Reshuffling: Participants must sit at a table different from the phenotype they worked on in Phase 1.
  2. Adjudication Process: Designated Clinical Experts lead the adjudication of “The Delta” (concepts where disagreement existed) to establish the True Gold Standard (TGS).
  3. Prioritization and Decision Making: The list is prioritized by maximal disagreement and highest Concept Prevalence (RecordCount). The Clinical Expert makes final decisions, aided by a neutral volunteer (the “Honest Broker”) to ensure the adjudicator remains strictly true to the original clinical description, preventing definition drift.

11:30 AM - 12:00 PM: Wrap-up and The Reveal

  • Adjudication stops.
  • Presentation and discussion of preliminary results (Performance will be evaluated by comparing concept sets against the TGS, primarily using the Prevalence-Weighted F1 Score).

We look forward to a productive and insightful workshop. Thank you for your participation!

Sincerely,

Gowtham A Rao, Azza A Shoaibi, and the Minds Meet Machines Team

2 Likes
  1. brian toy, usc
  2. uveitis (operationalize SUN classification criteria for sarcoid uveitis, syphilis uveitis, VKH, HSV anterior uveitis), retinal vasculitis
1 Like

Brian, so good to have you lead a group again. in updated the first post with your information

I am interested in Critical Care especially on Sepsis. I have an ongoing research no phenotyping and examining hospital outcomes.

Open Challenge to AI Innovators: Participate in the OHDSI 2025 “Collaborative Intelligence” Scientific Evaluation

Dear OHDSI Community and Innovators in Clinical Informatics,

Following up on the initial ‘Call for Collaboration’ and the detailed experimental design (the “Run-of-Show”) posted previously, we are now actively recruiting for the “AI Arm” of this experiment.

The proposed OHDSI 2025 Symposium session, "Collaborative Intelligence: Humans and AI in Concept Set Development," is a "Minds Meet Machines" challenge designed as a rigorous, hypothesis-driven scientific study. The scientific rigor of this evaluation depends on comparing expert human curation against the most advanced, distinct (multiple) AI pipelines available.

To ensure this is a world-class, scientific evaluation of Humans versus AI in clinical modeling, we must actively seek out and challenge the leaders in this space.

We have conducted an extensive search spanning academic literature (PubMed, preprints, informatics journals), conference proceedings, open-source repositories (GitHub), and proprietary announcements. Our objective was to identify the academic teams, open-source projects, and industry leaders who are actively developing methodologies for AI-assisted concept set development, value set management, and clinical coding (including OMOP, ICD, HPO, and others).

The organizations listed below are the identifier innovators in our systematic search. If you have not been identified, please reach out to us now and you will be invited.

If your organization is listed, it is because you have publicly claimed innovation in this domain. To contact you, we are posting this here, and we are emailing you directly.

We invite you to bring your solutions and expertise to this open collaboration. We challenge you to validate your methodologies within our standardized, blinded, randomized comparative framework using the ~28 defined phenotype conditions detailed in the experimental protocol (see the first post in this thread).

To the Innovators Listed: Are you up to the challenge?

To the OHDSI Community: The success of this collaboration relies on bringing these innovations together. If you collaborate with these teams or individuals, we encourage you to facilitate an introduction or forward this invitation.

Below is the comprehensive list of identified innovators and their publicly available contact information, compiled entirely through public domain searches.

OHDSI 2025 “AI Arm” Challenge: Outreach Tracker

Name Key Person(s) / Public Contact Why You (Focus/Innovation) Link(s) Response
Academic & Research Initiatives
Columbia University DBMI Chunhua Weng, PhD (chunhua@columbia.edu ); George Hripcsak, MD; Anna Ostropolets, MD, PhD anna.ostropolets@columbia.edu Criteria2Query (C2Q) 3.0 (NLP to OMOP SQL); LLM for concept set curation (refining PHOEBE); Automated taxonomy learning. Columbia DBMI
Vanderbilt University Medical Center (VUMC) DBMI Wei-Qi Wei, MD, PhD: wei-qi.wei@vumc.org ; Chao Yan, PhD: chao.yan.1@vumc.org Evaluation of LLMs (GPT-4, Claude) for generating executable phenotyping algorithms (SQL queries) adhering to a CDM. Vanderbilt DBMI; Large Language Models Facilitate the Generation of Electronic Health Record Phenotyping Algorithms
Stanford University (AIMI / Shah Lab) Nigam H. Shah, MBBS, PhD (nigam@stanford.edu ) Extensive research on LLM evaluation in healthcare, including automated billing code assignment and AI fairness frameworks. The Stanford Center for Artificial Intelligence in Medicine and Imaging (AIMI)
UTHealth Houston & Mayo Clinic (Agentic MCP) Hua Xu, PhD: Hua.Xu@uth.tmc.edu ; Hongfang Liu, PhD: Hongfang.Liu@uth.tmc.edu Development of the Agentic Model Context Protocol (MCP) framework for zero-training, hallucination-preventive OMOP mapping using LLMs. An Agentic Model Context Protocol Framework for Medical Concept Standardization
King’s College London (KCL) / Alan Turing Institute (MedCAT Team) Prof. Richard Dobson: richard.j.dobson@kcl.ac.uk Developers of MedCAT/CogStack; NLP for extracting/linking clinical concepts to UMLS/SNOMED CT; Integrating Transformer models. Medical Concept Annotation Tool
UCSF - Bakar Computational Health Sciences Institute Vivek Rudrapatna, MD, PhD: Vivek.Rudrapatna@ucsf.edu Utilizing LLMs and NLP to transform unstructured narratives into structured data, including OMOP mapping and knowledge graphs (SPOKE). UCSF Bakar Computational Health Sciences Institute
Mass General Brigham (MGB) Research Team Heekyong Park, PhD (Corresponding Author): hpark25@mgb.org Utilizing LLMs + RAG to improve phenotyping accuracy from unstructured EHR data, comparing against traditional ICD code methods. A Comprehensive Evaluation of LLM Phenotyping Using Retrieval-Augmented Generation (RAG): Insights for RAG Optimization
Mount Sinai Health System (MSHS) Eyal Klang, MD (Corresponding Author): eyal.klang@mountsinai.org Evaluation of RAG-enhanced LLMs (GPT-4, Llama-3.1) for automating Emergency Department ICD-10-CM coding vs. human coders. Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders
Yale School of Medicine rohan.khera@yale.edu ; Samah Fodeh, PhD (Corresponding Author): samah.fodeh@yale.edu Novel Sentence Transformer-based NLP approaches for mapping EHR data (medications) to the OMOP CDM. A Novel Sentence Transformer-based Natural Language Processing Approach for Schema Mapping of Electronic Health Records to the OMOP Common Data Model
CLH (Code Like Humans) Univ. of Cambridge/Microsoft Research Authors andreas@motzfeldt.dk ; Yu-Neng Chuang (Corresponding Author): yc577@cam.ac.uk LLM-based agentic framework designed to automate medical coding (ICD-10) by mirroring human processes. Code Like Humans: A Multi-Agent Solution for Medical Coding
MedCodER Univ. of Illinois Urbana-Champaign Authors sanmitra1@gmail.com ; Y. Zhang & J. Gao (Corresponding Authors): {yuz9, jgao8}@illinois.edu Generative AI framework for automatic medical coding (ICD code prediction) using extraction, retrieval, and re-ranking. MedCodER: A Generative AI Assistant for Medical Coding
Erasmus MC - Medical Informatics Peter Rijnbeek, PhD (p.rijnbeek@erasmusmc.nl) Exploring NLP and LLMs for extracting concepts from clinical text (including non-English languages) for phenotyping. Erasmus MC - Medical Informatics
The Jackson Laboratory & Charité Berlin Peter N. Robinson, MD, MSc (Peter.Robinson@jax.org) Development of Human Phenotype Ontology (HPO); Computational deep phenotyping and ontology-based data models (Phenopackets). The Jackson Laboratory & Charité Berlin
KOMAP University of Florida and NVIDIA Authors tcai@hsph.harvard.edu Knowledge-Driven Online Multimodal Automated Phenotyping; Uses knowledge graph embeddings on EHR concepts to generate feature lists. Knowledge-Driven Online Multimodal Automated Phenotyping System
Open-Source Tools & Consortia
The Monarch Initiative Melissa Haendel, PhD; Chris Mungall, PhD info@monarchinitiative.org International consortium for deep phenotyping; NLP/ML to extract phenotypic information and standardize using ontologies (HPO). Monarch Initiative
Llettuce University of Manchester Authors g.figueredo@nottingham.ac.uk
R.A.C. and G.D. (Corresponding Authors): {rebecca.croft, goran.davidovic}@manchester.ac.uk
Open-source tool using local LLMs, semantic search, and fuzzy matching for converting medical terms into the OMOP standard vocabulary. Llettuce: An Open Source Natural Language Processing Tool for the Translation of Medical Terms into Uniform Clinical Encoding
RAG-HPO Baylor/Texas Children’s Hospital Authors Jennifer.Posey@bcm.edu Python tool using RAG to enhance LLM accuracy in assigning Human Phenotype Ontology (HPO) terms from medical free-text. Improving Automated Deep Phenotyping Through Large Language Models Using Retrieval Augmented Generation
GCAF (Generalised Codelist Automation Framework) A. Aslam (Corresponding Author): a.aslam@sheffield.ac.uk Framework and Python repository for automating the development of clinical codelists (Readcodes, SNOMED). An automation framework for clinical codelist development validated with UK data from patients with multiple long-term conditions
CLAMP (Clinical Language Annotation, Modeling, and Processing) Hua Xu, PhD (Hua.Xu@uth.tmc.edu ) Comprehensive clinical NLP toolkit; Tools for entity recognition and mapping clinical concepts to UMLS/SNOMED CT. CLAMP Clinical Language Annotation, Modeling, and Processing Toolkit
OHNLP/omop_mcp jaerongahn@gmail.com Open Health NLP; Open-source Model Context Protocol (MCP) server for mapping clinical terminology to OMOP concepts using LLMs. GitHub omop_mcp
pyomop bell@nuchange.ca Python library for OMOP CDM; Includes LLM-based natural language queries and LLM-agent assisted FHIR to OMOP mapping. GitHub pyomop ; Vibe Coding FHIR to OMOP - Bell Eapen MD, PhD.
MedRAG / MIRAGE Benchmark Univ. at Buffalo, NCBI/NLM/NIH Authors Toolkit (MedRAG) and benchmark (MIRAGE) for the systematic evaluation of RAG systems in the medical domain. Benchmarking Retrieval-Augmented Generation for Medicine
OHDSI Community Efforts
OHDSI Generative AI Workgroup OHDSI Community Members Official workgroup dedicated to the application of generative AI for RWE, including automated taxonomy learning and concept sets. Large Language Models Can Enhance OHDSI Evidence Generation Mission
CHIMERA OHDSI Contributors Initiative focused on automatic concept set creation and mapping to standard OMOP codes within the OHDSI ATLAS platform. CHIMERA: Automatic Concept Set Creation and Mapping to Standard OMOP Codes in ATLAS
LLM-Powered OMOP Conversion Tool Artem Naumenko (OHDSI: ents) daniel@hyperunison.com Community initiative utilizing an LLM-first approach for automated structural and semantic mapping to accelerate OMOP conversion. Accelerating OMOP Conversion with free LLM-Powered Tool – Looking for Test Users
Proprietary Software & Industry
RWE & Data Harmonization Platforms
IQVIA Contact Page AI solutions to accelerate OMOP conversions and AI-driven analytics for RWE generation. IQVIA HealthGrade AI
Evidentli info@evidentli.com Piano and Auto-Mapper tools; Uses ML/NLP to automate the standardization of medical concepts for OHDSI CDM transformation. Evidentli
OM1 info@om1.com PhenOM AI Platform; AI-powered digital phenotyping to create digital phenotypic “fingerprints” for cohort identification. OM1
Nference Google Search nferX AI platform; Advanced NLP to normalize data and map concepts to standardized ontologies for RWE generation. nference
Truveta Google Search “Truveta Language Model” (TLM) to clean, harmonize, and normalize disparate EHR data, including extensive concept mapping. Truveta
Terminology & NLP Providers
IMO Health CustomerSupport@imohealth.com LLMs for automated medical coding (ICD-10, CPT, SNOMED CT) and “smarter value set management,” enhanced with RAG and proprietary terminology. IMO Health
John Snow Labs info@johnsnowlabs.com Healthcare NLP & LLM library; Healthcare-specific LLMs to extract entities and map them to standardized terminologies (RxNorm, ICD, SNOMED, HPO). John Snow Labs
Autonomous & AI-Assisted Coding
Solventum (Formerly 3M HIS) Helps.us@solventum.com 360 Encompassℱ system; AI-powered autonomous and computer-assisted coding solutions. Solventum
Fathom Health hi@fathomhealth.com Deep learning and LLMs to automate medical coding and auditing (E/M, CPT/HCPCS, ICD). Fathom Health
BUDDI.AI Google Search CODING.AI; Deep learning platform to automate medical coding (CPT, ICD 10). BUDDI.AI
Nym Health Contact Nym | Trusted Medical Coding & Billing Software Autonomous medical coding using “Clinical Language Understanding (CLU)” to translate notes into billing codes (ICD-10, CPT). Nym Health
CodaMetrix hello@codametrix.com AI-Powered Contextual Coding Automation Platform; Uses NLP and a “knowledge graph” to improve coding quality. CodaMetrix
Maverick Medical AI info@maverick-ai.com Autonomous medical coding platform powered by deep learning, aiming for high “Direct-to-Bill” rate. Maverick Medical AI
MediCodio info@medicodio.ai AI-powered coding automation; Hybrid strategy using LLMs/AI with SNOMED CT for clinical interpretation, then converting to ICD-10. Medicodio
AGS Health https://www.agshealth.com/contact-us/ Intelligent AuthorizationÂź Autonomous Coding; NLP/ML to automate ICD-10-CM, PCS, CPT, and E&M code assignment. AGS Health
AutoICD info@autoicdapi.com AI-powered Clinical Coding API; NLP to process unstructured texts and generate structured data in SNOMED-CT and ICD10. AutoICD
MediMobile (Genesis) support@MediMobile.com Autonomous medical coding powered by AI (Genesis); Automates CPT & ICD-10 coding in real-time from documentation. MediMobile
Ambient AI & Documentation
Abridge Connect with AI Healthcare Experts | Contact Abridge Ambient AI for medical conversations; Real-time mapping of spoken clinical concepts to ICD-10, SNOMED, CPT. Abridge
Suki AI support@suki.ai AI-powered voice solutions; Automatically generates suggested ICD-10 and CPT codes based on captured clinical context. Suki AI
Prometheus sarah_seager@epam.com Prometheus is a powerful and versatile platform that can enable collaborative research and data analytics in the healthcare industry. Its comprehensive features, compatibility with industry standards, and focus on security and collaboration make it a valuable tool for researchers and healthcare organizations seeking to leverage the power of data to improve patient outcomes. https://solutionshub.epam.com/solution/prometheus

Call to Action for the AI Arm

This experiment is intended to generate high-quality, peer-reviewable evidence regarding the integration of AI into the phenotyping lifecycle.

If your organization is interested in submitting your AI pipeline for this evaluation, please REPLY to this forum thread or contact me directly rao@ohdsi.org .

As noted in the experimental stipulations:

“To ensure reliability and fairness, the AI-generated concept sets have to be generated prior to the symposium using the exact same standardized inputs provided to the human teams.”

“AI Methodology Transparency: The AI prompt and system settings (e.g., agent, version, temperature) used for the pre-generated AI concept sets will be documented.” However, you do not have to reveal your proprietary secrets. Also this is not an opportunity to market yourself.

We will coordinate with participating teams on the next steps for this crucial pre-symposium generation phase.

Sincerely,

Gowtham Rao

On behalf of the Phenotype Development and Evaluation Workgroup Organizing Leads & Contributors (Gowtham A Rao MD, PhD; Azza A Shoaibi PhD; Joel Swerdel PhD; Jack Murphy MPH, PhD)

This is SUCH a cool idea @Gowtham_Rao ! Question - why are the AI results generated ahead of time? Wouldn’t it be better that generation of AI results is controlled in the same way as for the manual results, with a time limit and recording of screen/audio? Otherwise it seems the AI developers would have an unfair advantage, being able to review the results and refine their tool as much as they want prior to the event. Even though the input prompt has to be standardized, the tool developer in theory could refine system prompts, context, and other functionality to improve the output.

I suppose this is a risk either way, as the AI developers could still prepare ahead of time
but at least having their activity time-bound and recorded at the event would afford a bit more control. The only way to prevent this sort of preparation (which is a risk for the “manual contributors” too) would be for the conditions of interest to be revealed at the time of the event. I can see why you might want the conditions to be known in order to recruit the right clinical experts, but perhaps just revealing the higher level therapeutic areas would be sufficient? Or maybe there could be a “wildcard” condition that’s only revealed day-of?

Hi @katy-sadowski - this is great feedback, and exactly why we are having this discussion now while we still have time to refine our experimental design. Your questions about the timing and control of the AI generation are critical.

I am thinking maybe we create a final protocol based on all this discussion. For now, let me try to clarify my thinking on these points, and you can provide feedback to help improve.

1. The AI Arm: Decoupling Algorithm from Infrastructure and the Impracticality of Live Execution

The primary objective of this evaluation is strictly to assess the algorithmic efficacy of the AI solutions in terms of quality/validity compared to the human-generated version (evaluation scale being developed by @Jack_Murphy and evaluation tool being developed by @gdravida09 ).

We must isolate the quality of the output from confounding variables related to the execution environment. As you noted, the potential AI solutions range from academic prototypes (potentially constrained by API limits or budgets) to mature commercial platforms on hyperscaler infrastructure.

This variability is precisely why we cannot enforce a live execution with a strict time limit (e.g., 1 hour) for the AI arm, as we do for the human arm. Measuring speed or cost in this context largely reflects available resources rather than the AI’s inherent capability to interpret clinical descriptions. If a graduate student has a creative way of using LLMs, but because they don’t have a hyperscaler cloud environment, their laptop takes 4 days to generate the concept set - it’s ok, we won’t judge that. Therefore, we must allow the AI arms to execute in their native environments ahead of the symposium.

2. The AI Arm: Ensuring Control, Autonomy, and Eliminating Human Intervention

While we cannot control the infrastructure, we absolutely must control the process to ensure fairness, which addresses your concern about developers having unlimited time to review and refine the output.

I strongly agree that we must rigorously exclude “human-in-the-loop” (HITL) intervention during the execution phase. If a human modifies the AI-generated concept set, we are measuring the human’s skill in phenotyping or picking codes, not the AI’s capability.

To address this, we could implement the control mechanism you suggested: recording the process. We will require a continuous, unedited, screen-recorded video of the end-to-end process for the AI arm. This video must verify that the execution was a single, autonomous run and that human interaction was strictly limited to:

  1. Inputting the standardized clinical description text.
  2. Initiating the process.

The subsequent concept set generation must be entirely autonomous, with no iterative tweaking, correction, prompt refinement, or reprocessing by the operator during that run (they can of course re-run as many times as they want - they just cant pick codes i.e. the output should be autonomous). This video evidence can be audited by an honest broker. Plus, the AI submitter can swear to be truthful.

3. The Human Arm and Equitable Protocols

To ensure a methodologically sound comparison, we have to establish clear execution protocols for both arms.

3.1 Timing and Preparation

The input (the clinical description) will be released simultaneously to both the AI and human teams, about 7 to 10 days prior to October 9th.

We considered the “wildcard” idea of revealing the condition on the day of the event to completely prevent preparation. However, as you mentioned, revealing the area in advance aids in recruiting the right clinical experts and allows for necessary logistical preparation.

It is true that both sides can prepare during this 7-10 day window. Human teams can research the topic and familiarize themselves with the terminology. AI teams can, in theory, refine their system prompts or functionality. We view this preparation as equitable: humans utilize their expertise and research (analogous to pre-training), and AI utilizes its models. The key is that the final execution phase is rigorously controlled for both arms (the 1-hour live session for humans, and the single autonomous run video for AI).

3.2 Execution Environment and Integrity

  • Environment and Tools: Human teams must execute the task on-site within the allocated 1-hour timeframe, utilizing the Atlas platform that will be specially deployed for this experiment. All the development of concept sets and expressions will be done by the team with the clinical lead sitting with them on their own laptops. The Atlas platform will not have the AI-generated concept set, of course - as they would be pre-submitted by the AI arm and the backend honest broker team would have possession of it. The use of LLMs or automated generation tools is strictly prohibited for the human team.
  • De Novo Development and Proctoring: It is critical that concept sets are developed de novo (anew) during the session. Clinical leads will actively proctor the evaluation to ensure participants do not import pre-existing or “canned” concept expressions from prior work. But the humans can think about and practice in advance in the 7 to 10 day period and prep.
  • Process Transparency: To facilitate a comprehensive analysis of the human workflow, collaborative dynamics, and decision-making process, the conversations and interactions within the human teams will be recorded. The proctor’s role is to ensure the development process is organic to the session and that pre-existing assets are not being imported.
  • Scientific Integrity: This collaborative study relies fundamentally on the integrity of its participants. We trust that all collaborators share the OHDSI community’s commitment to rigorous, transparent science and will adhere to the established protocols in good faith, ensuring an honest evaluation.

We do our best in establishing these clear parameters :smiley: and hope there are no cheaters. This will create a rigorous and equitable evaluation focused on the core challenge: the intellectual and methodological ability to translate a clinical description into high-quality concept sets. I welcome further feedback on these guidelines.

Joking here: maybe we are already more rigorous than some RCTs .

Great Idea for a study @Gowtham_Rao @Azza_Shoaibi and others.

I for one am tired of hearing about how AI will take my Job.

Let’s settle this.

One question. If I’ve read the schedule right, we are doing all (3-4) concept sets in 55 min and running a Delphi process for each one?

[I’m not scared. You’re scared. I’m just asking]

Thanks @Gowtham_Rao for the additional context!

It makes sense that we don’t want compute resources / API access to limit who can participate. And while the amount of time spent can have a huge impact on validity/completeness of a human-curated set, for obvious reasons we don’t want to leave that open-ended. So we’re really comparing “what an AI tool can get done without a time limit” to “what a group of experts can get done in 1 hour.”

And then we layer on the 7-10 days of preparation, which is where I’m getting a bit more concerned because there is no control over what happens during that time. Perhaps teams should be required to report exactly what they did during that time period to prepare? (Hopefully folks will be too busy with the pre-Symposium rush to do much of anything
 :rofl:).

Finally - it’s good to hear the AI tasks will be required to be a single (hopefully recorded) uninterrupted session without human-in-the-loop. And I’m assuming the tools will be required to share their system prompts / context / etc. so we can be sure they’re operating in a capacity agnostic to the specific indications being studied and the formatting of the clinical description input.

I still do like the idea of a wildcard phenotype. I don’t think it’d be that crazy to choose an indication within a therapeutic area where we’ve already got a team of experts assembled. Any AI tool that can run in an hour or less (I’d actually guess this would apply to most of them?) could choose to participate.

It sounds like the expectation is that the human teams (aka Real Intelligence, RI) will make it through 1 concept set / Delphi round.

Not that I was scared. AI was scared.

If we’re only doing one, I’d also be game for a wild card phenotype within a domain, with reading material provided that morning - @katy-sadowski 's suggestion.

Experts could be given a list of 20 to validate that any one of the 20 are appropriate for their table, no advanced prep material, target concept set named at the start, and the reference material given.

Unlike AI, RI can handle that in 1 hour. Realistically, you wouldn’t want to spend more than that on research + creation of an individual concept set anyway, so it would remain a meaningful comparison.

IF we think that remains unfair given the prep material is available to the AI teams in advance, another option would be to ‘split the difference’; distribute the materials across the N=[3 or 4?] possible concept sets for each table to it’s clinical lead, but only announce which one of those 4 is the chosen one the day of? That way at least the leads have to spread any preparatory effort out, and are less likely to hyper focus on one where they think where they have an edge on AI.

Frankly, given my own schedule in advance the symposium, I’d be happy with the first option and just roll the day of.

One other question: who is going to be assigned the French Fry Salad concept set?

Dear Innovators and Colleagues,

We are following up on our invitation (below) and the ongoing discussion on the OHDSI Forum regarding the “AI Arm” of the OHDSI 2025 Symposium session, “Collaborative Intelligence: Humans and AI in Concept Set Development.”

We remain committed to ensuring this “Minds Meet Machines” challenge is a rigorous, hypothesis-driven scientific study. The validity of this evaluation depends entirely on comparing expert human curation against multiple, distinct, state-of-the-art AI pipelines.

To date, commitments to the AI arm have been limited. We understand the hesitation. Participating in a rigorous, head-to-head comparison can be sensitive; innovators may have valid concerns about protecting intellectual property (IP) or may be reluctant to have their solutions publicly ranked against comparators.

PLEASE NOTE – you do not need to identify yourself and you do not need to come in person!!! This event is FREE! See new option below.

Our Goal is Scientific Advancement, Not Commercial Ranking

Our primary objective is to generate high-quality, peer-reviewable evidence regarding the integration of AI into the phenotyping lifecycle. We want to foster collaboration and understand the current landscape of AI capabilities, not create a competitive leaderboard.

New Option: De-Identified (Blinded) Participation

To encourage broader participation and address the concerns mentioned above, we are formally introducing an option for de-identified (blinded) participation.

If you wish to protect your IP, or if you prefer not to be publicly ranked, you are welcome to participate anonymously.

If you choose this option:

  1. Your pipeline’s results will be fully included in the blinded adjudication phase of the experiment.
  2. During the final “Reveal” at the Symposium and in any subsequent publications, your results will be presented anonymously (e.g., “AI Pipeline A,” “AI Pipeline B”) rather than attributed to your organization.
  3. Crucially, you will still receive your organization’s specific, individualized performance results privately.

An Opportunity for Rigorous, Confidential Evaluation

We strongly encourage you to view this as a unique opportunity for self-evaluation and validation. This de-identified approach allows you to:

  • Objectively Benchmark: Understand exactly where your methodology stands in relation to human experts and other AI approaches using standardized inputs across the 28 defined phenotype conditions.
  • Evaluate Confidentially: Gain invaluable performance insights without the pressure of public ranking, protecting your competitive positioning while contributing to scientific evidence.

Whether you participate openly or de-identified, this challenge offers an unparalleled chance to rigorously evaluate your solution within a standardized, randomized, and blinded framework.

The success of this landmark evaluation relies on bringing these innovations together.

Next Steps

We need to finalize the participating AI teams soon to coordinate the crucial pre-symposium generation phase. If you are ready to participate—either openly or de-identified—please reply to this email or contact me directly (rao@ohdsi.org) to confirm your preference.

Thank you for considering this vital collaboration.

Sincerely,

Gowtham Rao MD, PhD

On behalf of the Phenotype Development and Evaluation Workgroup Organizing Leads & Contributors (Gowtham A Rao MD, PhD; Azza A Shoaibi PhD; Joel Swerdel PhD; Jack Murphy MPH, PhD)


This message is also being posted as a continuation of the OHDSI Forum thread here: [[CONFIRMED - HYRBID and IN PERSON] OHDSI Symposium 2025 - Phenotype Development and Evaluation Work Group - Call for Collaboration: Collaborative Intelligence – Humans and AI in Concept Set Development in Phenotyping]

Planning discussion
Youtube video is at Phenotype Development and Evaluation workgroup Youtube Channel here

This is a detailed summary of the OHDSI Phenotype Development and Evaluation Work Group meeting held on September 26, 2025. The meeting focused on the urgent planning and logistical coordination for the “Minds Meet Machines” challenge, scheduled for the OHDSI Global Symposium on October 9, 2025.

Overview: The “Minds Meet Machines” Challenge

The workgroup is organizing a four-hour, live scientific experiment designed to rigorously compare the performance of AI (specifically Large Language Models - LLMs) against expert human curation in developing phenotype concept sets.

The experiment utilizes a multi-arm, blinded, randomized, controlled design. Human teams will generate concept sets live, while AI-generated concept sets will be produced prior to the symposium using the same standardized inputs. The outputs will then be blindly adjudicated.

Gowtham Rao opened the meeting by stating that planning is “close to being in the red area,” emphasizing the complexity of the undertaking and the need to accelerate execution with the symposium less than two weeks away.

Key Discussion Points, Decisions, and Risks

1. Experimental Design and Logistics

Structure: The event will feature several tables (estimated at four, based on confirmed clinical leads), each focusing on one specific clinical idea (e.g., Systemic Lupus Erythematosus led by Dr. @Christopher_Mecoli ). There are approximately 48 in-person registrants, yielding an estimated 30-35 active participants.

Technology: A dedicated, stable ATLAS instance will be deployed. It will be active during the one-hour human generation phase and immediately locked (made read-only) afterward.

  • Decision: The ATLAS instance will use the latest version with PHOEBE and concept prevalence.
  • Action Item: The specific vocabulary version must be confirmed (with Patrick Ryan and Konstantin) to ensure consistency for both human and AI arms.

Participant Assignment: To ensure fairness and balance expertise, stratified randomization will be used.

  • Decision: Participants will be surveyed a week prior to rank their confidence and skills in informatics/tooling, allowing organizers to ensure a blend of senior and junior expertise at each table.

Data Capture: Human deliberations during both generation and adjudication will be audio-recorded (with consent) for qualitative analysis of decision-making processes.

2. AI Participation (Critical Risk)

A major concern is the low number of confirmed AI participants, which threatens the statistical power of the study.

  • Status: Only two AI workflows are confirmed: J&J (Joel Swerdel’s script) and EPAM (Prometheus, led by Daria Zhukova).
  • Mitigation: Gowtham Rao has conducted extensive outreach to academic and industry innovators. To address hesitation regarding public comparison and potential negative exposure, the option for blinded (anonymized) participation has been formally introduced, allowing teams to benchmark their tools confidentially.

3. Standardization and Inputs

Standardized Clinical Descriptions: These inputs must be finalized one week before the symposium.

  • Decision: Azza Shoaibi will propose a standardized input format, which must include the intended utility of the phenotype. These inputs must be signed off by the clinical leads in advance to ensure stability during the live session.
  • Action Item: A dedicated training/alignment call will be organized for Clinical Leads.

AI Generation Timing: AI concept sets will be generated before the event. This ensures fairness by decoupling algorithmic quality from computational infrastructure limitations (e.g., academic vs. industry resources).

4. Adjudication and Evaluation Methodology (Major Discussion)

The most debated topic was how to define the “Gold Standard” and measure performance.

Blinding and Bias: Adjudication uses a crossover design (participants evaluate sets they did not create). However, a potential bias was identified: Clinical Leads, having overseen the human generation, will inherently recognize that set during adjudication.

  • Mitigation: Clinical Leads must be strictly instructed to act only as neutral subject matter experts during adjudication and not advocate for any specific codelist.

The “Gold Standard” Debate: The group debated the validity of using a pre-defined “Target Standard” (created in advance by organizers Gowtham, Azza, and Patrick) as the ultimate benchmark, as this might bias results toward those experts’ views.

  • Jack Murphy’s Proposal (Target Standard as Benchmark): Jack proposed a rubric combining:

    1. F1 Score (Precision/Recall): Automatically calculated against the pre-defined Target Standard.
    2. Structural Proficiency Score: Human-adjudicated score evaluating the efficiency and parsimony of the concept set expression (e.g., appropriate use of hierarchies).
  • Azza Shoaibi’s Counter-Proposal (Adjudicated Gold Standard): Azza proposed a more rigorous approach where the Gold Standard is derived from the experiment. All concepts generated by Humans, AI, and the Target Standard would be pooled. Concepts agreed upon by all are accepted; all discrepancies (deltas) are adjudicated. The final, consensus-driven list becomes the “True Gold Standard,” against which all initial attempts are then measured.

  • Decision: The evaluation methodology remains unconfirmed. A dedicated meeting is required to finalize the approach.

Vocabulary Handling: The group discussed how to compare concept sets if AI tools submit non-standard vocabularies (e.g., ICD-10 only).

  • Decision: The evaluation tool (developed by Gaurav Dravida) will facilitate like-for-like comparisons within the submitted vocabulary (e.g., comparing AI ICD-10 codes only against Human ICD-10 codes).

Run of Show (October 9th)

The four-hour session is tightly scheduled, detailed in the OHDSI forum post:

  • 8:00-8:45 AM: Introduction and Standardization Training (including critical ATLAS naming conventions).
  • 8:45-9:45 AM: Phase 1: Human Baseline Generation (Live).
  • 9:45-10:30 AM: Transition & AI Overview (“The Critical Window”). While AI leads present their methodologies, the backend team executes the high-risk task of extracting, blinding, and loading the human-generated data into the adjudication tool.
  • 10:30-11:45 AM: Phase 2: Blinded Comparative Adjudication (Crossover design).
  • 11:45 AM-12:00 PM: The Reveal (Unblinding and preliminary results).

Action Items

  1. Finalize Evaluation Methodology: Azza to schedule a dedicated meeting (including Jack Murphy, Martijn, and Patrick Ryan) to finalize the Gold Standard definition and adjudication process.
  2. Standardized Inputs: Azza to circulate the template for standardized clinical descriptions.
  3. Clinical Lead Coordination: Gowtham to organize the training/alignment call with Clinical Leads and confirm any outstanding leads.
  4. Vocabulary Version: Gowtham/Azza to confirm the specific vocabulary version for the ATLAS instance with Patrick Ryan.
  5. Participant Stratification: Organizers to survey participants on their confidence/skills for stratified randomization.
  6. AI Recruitment: Continue aggressive outreach for AI participants, emphasizing the anonymous option.
  7. Technical Development: Gaurav and the backend team (Sajan) to continue developing the adjudication tool and data pipeline, adapting to the finalized evaluation methodology.
  8. Target Standard Creation: Gowtham, Azza, and Patrick to create the “Target Standard” concept sets next week.

Hi, @Gowtham_Rao. Thank you for initiating this effort. We (Jaerong Ahn and Hongfang Liu from UT Health Science Center) would like to participate in the AI arm and the challenge. Let me know if there are any next steps.

1 Like

Thanks! I can be there in person and do RA for rheumatology, and curate any other rheum concepts as a clinical/domain expert.

1 Like

I have added you to the list of experts. @Liz_Park @Christopher_Mecoli - we have two rheumatology experts. We are still finalizing the research design and i will follow-up with details.

1 Like

Ben Hamlin (IPRO)

  1. sepsis
  2. DVT
1 Like

Update on the AI Arm: Confirmed Participants and Instructions for the “Minds Meet Machines” Phenotyping Challenge

Hello everyone,

Who is in the AI Arm?

The AI arm features a diverse range of participants, including academic researchers, technology startups, and industry teams who are leveraging AI/LLM-driven tools for concept generation.

We understand that some groups prefer discretion (e.g., to de-risk). Therefore, participants can choose to be publicly identified or participate anonymously. Anonymous participants will be assigned a private ID so they can track their tool’s performance relative to others and the human-led teams.

We currently have the following AI organizations participating:

  1. @jswerdel (J&J): Utilizing concept set generating software under development.
    • Format: Conforms to OMOP OHDSI Atlas concept set expression.
    • Presentation: Will present for 10 minutes.
  2. @darya.zhukova (Prometheus - EPAM):
    • Format: Conforms to OMOP OHDSI Atlas concept set expression.
    • Presentation: Will present for 10 minutes.
  3. Anonymous 01:
    • Format: Confirmed, but not sure if output can be provided in OMOP OHDSI Atlas form.
    • Presentation: Will not present.
  4. @Niko_Moller-Grell:
    • Format: Has a CLI (no UI/UX). Familiar with OMOP, vocabulary, and concept set generation as defined per Atlas. Will try to develop structured JSON concept set expression (currently the workflow is producing semi-structured list of concepts).

There are few more holding back - that I am trying to pull them in.

General Instructions for the AI Arm

To ensure a fair, unbiased comparison and to evaluate the raw capabilities of the AI tools, the following standardized instructions have been provided to all AI participants:

1. The Input:
Participants will receive brief clinical descriptions (approximately 2 lines each).
Example: “Severe systemic lupus erythematosus is a condition that is associated with flares or severe manifestations
” This will be in coordination with the clinical leads.

2. The Task:
Use your AI tool to process these descriptions and generate corresponding code lists.

3. The Output Format:
Provide at least one or more of your best concept sets for each clinical idea. The output must be in one of the following formats:

  • Preferred: An OMOP OHDSI Atlas Concept Set expression (i.e., a JSON we can import into ATLAS concept set expression).
  • Alternative: A list of standardized codes (SNOMED preferred, or ICD10CM).

4. The CRITICAL Rule—Integrity of Output:
The most vital condition for this study is that the output of the AI concept set/code list must not be modified by a human. No post-editing is allowed. We rely on the integrity of the participants to submit the honest, raw output of the AI tool.

5. Submission Deadline:
All AI-generated concept sets must be submitted by October 6th, 2025.

Presentation and the Worksession

The AI-generated concept sets will be collected, standardized, and blinded before the live worksession. During the symposium, these sets will be adjudicated by human experts and compared against the outputs from the other study arms.

Participation Options:

  • AI participants can join the work session either in-person (allowing for active engagement) or virtually (which may be primarily observational).
  • Participants may choose to watch quietly or actively engage in the discussion.

Presenting:
The overall comparative analysis and results will be presented by the workgroup leads. Teams that have chosen to be identified are welcome to present or discuss their methodologies during the session (as noted above), but this is not required. They are not allowed to do a commercial/sales presentation.

We are looking forward to the valuable insights this rigorous experiment will bring to the OHDSI community.

Best regards,

Gowtham Rao
(On behalf of the Phenotype Development and Evaluation Work Group)

  1. Michael Riley, Georgia Tech Research Association
  2. I have experience working with OHDSI tools and providing MCP capabilities; I don’t have medical expertise so I’m happy to fill on any team.
1 Like

on site or remote?

Remote

1 Like