OHDSI Home | Forums | Wiki | Github

OHDSI Phenotype Phebruary 2024 and workgroup updates

OHDSI Phenotype Phebruary 2024


Phenotype Phebruary represents our communities collective effort to advance the field of phenotyping in observational studies, backed by our community’s desire for continuous learning and improvement.

Working folder with files for Phenotype Phebruary 2024

Why Do We Conduct Phenotype Phebruary?

  • Community Engagement and Collaboration

    • Dedicated time for collaborative focus on phenotyping.
    • Fosters community engagement and collaboration.
    • Driven by the community’s interest and positive feedback from previous events.
    • Inclusive, involving a broad audience, from clinical investigators and software developers to data partners.
  • Advancement in Phenotyping Science

    • Aims to make progress in the field and science of phenotyping.
    • Enhancing the content of the OHDSI phenotype library.
    • Continues the progress made in previous years.
  • Education and Practice

    • Opportunities for education or training on phenotype development and evaluation.
    • Focus on training in the use of OHDSI tools.
    • Improvement of existing tools and practices in phenotyping.

What We Aim to Achieve in Phenotype Phebruary 2024

Key Components:

  1. Design and Manage a Collaborative Study

    • Month-long study focused on assessing consistency in phenotype definitions and methods.
    • Main goal: Evaluate reporting patterns and consistency among reported phenotype algorithms for the same clinical phenotype across observational studies.
  2. Weekly Activities

    • Four Weeks, Four Phenotypes: Focus on different phenotypes each week, selected via community feedback.
    • Activities per week:
      • a) Systematic Literature Search and Synthesis: Collaborative literature search for observational studies, abstract documentation, and report on phenotype algorithms in a structured format.
      • b) Replication Using Atlas and Conducting Network Study: Replicate phenotype definitions/algorithms using Atlas and run these definitions on a network of volunteer OHDSI data partners. Use tools like Cohort Generator, Cohort Diagnostics, and Phevaluator. Compile outputs.
      • c) Results Review and Dissemination: Summarize variations in population characteristics like incidence rates.


  • Continuous Improvement

    • A month dedicated to community learning and communication improvement.
    • Development of a standardized template for reporting phenotype definitions, building on each week’s progress.
    • Draft a reporting template by month-end as a model for phenotype algorithm reporting.
  • Scheduled Activities and Milestones

    • Weekly progress reports in OHDSI Tuesday community calls.
    • Recorded working community meetings throughout the month.
    • Management of tasks with assignees and timelines.
  • Community Collaboration and Education

    • Emphasis on community collaboration, with opportunities for involvement detailed.
    • Open invitation for observation; active participation and leadership opportunities provided.
    • Virtual collaboration with recorded meetings.

Details of Collaborative Study

  • Phenotype Definition Components

    • Objective: To investigate the components of phenotype definitions as reported in scientific publications. We aim to identify and categorize the common components used in defining a phenotype algorithm, such as code lists, temporal rules, anchor dates, and look-back periods.
    • Community Insight: Summarize experiences from the community regarding the human interpretation of published materials and their components in phenotype definitions.
  • Phenotype Representations

    • Objective: Identify and describe the various methods researchers use to represent phenotype algorithms in studies. This representation could range from descriptive paragraphs and detailed tables to graphs or computer-interpretable code. Explore the inclusion of phenotype definition components, particularly focusing on structures describing temporal rules in relation to an anchor date.
    • Community Insight: Summarize community experiences in interpreting these representations in publications.
  • Phenotype Validation Methods

    • Objective: Assess the range and diversity of methodologies employed for phenotype validation, along with their interpretative aspects. This includes examining techniques, their details and implications, from chart reviews to computational methods like PheValuator. Evaluate the impact of these methodologies on the reliability and accuracy of phenotype validation.
    • Community Insight: Discussion and deliberation on value of such validation methods and outputs to inform study result interpretation.
1 Like

How can you contribute.

Get involved or lead a task described below. These tasks will be repeated in the same order every week throughout Phenotype Phebruary, allowing community members to contribute and make steady progress.

Details to be discussed in future calls. Please see this thread for meeting invites.

Task 1: Systematic Search

  • Perform a systematic search using a standardized template.
  • Compile a list of candidate papers to summarize.
  • Estimated commitment: 1.5 hours.
  • Individual activity and report out.

Task 2: Abstract Publications

  • Review identified literature from Task 1.
  • Populate an abstraction template for each paper (about 15 minutes per paper).
  • First, do a quick review of the article’s method section to identify algorithm details.
  • If there is no detail, mark as not having any information and skip.
  • Focus on code lists, phenotype algorithm rules, temporal rules, and performance characteristics.
  • Review and identify methods used for validation (e.g., chart review).
  • Estimated commitment: 40 opportunities, each about 15 minutes.
  • Individual activity and report out.

Task 3: Summarize Definitions

  • Identify how many phenotype algorithms are described for the clinical idea in the literature based on literature review.
  • Report on commonalities and differences in identified definitions.
  • Decide which definitions to create Atlas definitions for.
  • Implement identified definitions in Atlas.
  • Estimated commitment: 4 to 6 hours.
  • Group activity with leads.

Task 4: Run Cohort Generator, Cohort Diagnostics and Phevaluator

  • Try to run cohort diagnostics and Phevaluator.
  • Compile output and analyze.
  • Examine whether definitions identify the same/similar/different individuals.
  • Summarize differences in population characteristics and incidence rates.
  • Provide opportunities to gain experience with OHDSI tools.
  • Estimated commitment: 10 to 12 hours. Time to run diagnostics, compile output, and analyze.

Task 5: Long-term Collaboration

  • Begin drafting a manuscript based on community experience.
  • Plan for February 28th and 29th.
  • Set the agenda for future work in 2024.
  • Discuss plans for the OHDSI 2024 global symposium.
  • Consider subgroup workshops, such as improving performance characteristics of cohort definitions.
  • Estimated commitment: Time for manuscript drafting and agenda planning.
1 Like

Link to sign up form Phenotype Phebruary 2024 sign up


1 Like

Orientation Meeting for Phenotype Phebruary 2024: Agenda and Details

This meeting will provide an overview and detailed instructions for the tasks ahead.

Date and Time: Next Monday (February 5th 2024), 1-hour session.
Meeting invite has been sent out to those who signed up. You can also join by using the link below.

The primary aim of this meeting is to orient participants on the tasks and activities planned for the Phenotype Phebruary event. We will focus on the community selected conditions: Alzheimer’s disease, lung cancer (non-small cell and small cell), major depression disorder (MDD), and pulmonary arterial hypertension (PAH).


  1. Introduction and Objectives of Phenotype Phebruary.
  2. Overview of Selected Conditions.
  3. Detailed Discussion on Each Task:
    • Literature Search and Synthesis
    • Replication Using Atlas and Network Study
    • Results Review and Dissemination
  4. Demonstration using Alzheimer’s disease as an example.
  5. Q&A Session

Meeting Link: Join Microsoft Teams Meeting

Looking forward to a productive session and your valuable participation!

1 Like

An email from @Azza_Shoaibi.

Note: The referenced links are in the OHDSI MS tenant and you have to be logged in to MS tenant - and as of today, you need to have access to OHDSI MS tenant (freely available) to participate in Phenotype Phebruary. You also need to have signed up for the OHDSI Phenotype Development and Evaluation workgroup here


It’s February 1st, and Phenotype Phebruary has officially kicked off! Thank you all for your interest!

We’re thrilled to have 45 collaborators who have signed up to participate in this exercise. This is incredibly awesome!

Last Tuesday, our community selected the following four conditions to include in the study:

  • Alzheimer’s disease
  • Lung cancer (non-small cell and small cell, studied separately)
  • Major depression disorder (MDD)
  • Pulmonary arterial hypertension (PAH)

So let’s get started!

For each of the 4 conditions, we will undertake these sequential tasks:

A) Literature Search and Synthesis:

  • a.1. Start by identifying studies that reported or implemented phenotype definitions for the clinical condition of interest using the instructions found here: [Step-by-Step Instructions for PubMed Search on Phenotype Definitions]. Save the search results in the designated folder on Teams.
  • a.2. Abstract the phenotype definition specification into the literature review abstraction template. A copy of the template is available here: [Lung cancer definitions.xlsx]
  • a.3. Copy and paste the phenotype section into the phenotype representation template. A copy of the template is available here: [Phenotype Representation.docx]
  • a.4. Summarize differences in phenotype definitions and representation (to be done in a group session).

B) Replication Using Atlas and Conducting Network Study:
Replicate phenotype definitions/algorithms using Atlas and run these definitions on a network of volunteer OHDSI data partners. Utilize tools like Cohort Generator, Cohort Diagnostics, and Phevaluator. Compile outputs.

C) Results Review and Dissemination:
Summarize variations in population characteristics, such as incidence rates.

All materials and templates are in the following folder: Phenotype Phebruary 2024.

A group of us has completed the first task (literature search and synthesis) for Alzheimer’s disease. We will have an orientation on these tasks using Alzheimer’s disease as our example on Monday at 11:30 AM Eastern Time. Please join the Monday call for clarity on tasks and timelines. The call will be recorded for those unable to attend. The details are here.

For those interested in joining the Alzheimer’s disease phenotype work, our next steps are:

  • Summarize findings on reporting, validation, similarities, and differences in the algorithm: This will be through a small group discussion next week (final time to be announced).
  • Around 5 cohorts will be replicated from the literature to identify patients with AD: In the phenotype working group meeting on Friday, we will demo the replication of 2 of the 3 cohorts and review cohort diagnostics results.


These are the initial instructions on how to do a systematic search for Phenotype Algorithms. As part of Phenotype Pheburary 2024, we are looking forward to community input to make this search criteria and process better.

Step-by-Step Instructions for PubMed Search on Phenotype Definitions

Example: Alzheimer’s Disease

1. Accessing PubMed:

2. Using MeSH Terms:

  • Under the search field, select ‘Advanced’.
  • Choose ‘MeSH Terms’ from the dropdown in the ‘Add terms to query box’.
  • Enter your clinical condition of interest, e.g., ‘Alzheimer Disease’, and press ‘Add’.

3. Entering Search Terms:

  • Now, select ‘All Fields’ from the dropdown.
  • Copy and paste the following text into the ‘Enter search term’ box and press ‘Add’:
        (retrospective cohort) 
        OR (epidemiology [MeSH Terms])
        OR (Epidemiologic Methods[MeSH Terms])
        OR (phenotype[Text Word])
        OR (Validation Study[Publication Type])
        OR (positive predictive value[Text Word])
        OR (Validation Studies as Topic[MeSH Terms])
        OR (Sensitivity and Specificity[MeSH Terms])
        OR (insurance OR claims OR administrative OR health care)
    OR database 
    OR algorithm

        OR (Medicare)
        OR (Truven)
        OR (Optum)
        OR (Medstat)
        OR (Nationwide Inpatient Sample)
        OR (National Inpatient Sample)
        OR (PharMetrics)
        OR (PHARMO)
        OR (ICD-9[Title/Abstract])
        OR (ICD-10[Title/Abstract])
        OR (IMS[Title/Abstract])
        OR (electronic medical record[Text Word])
        OR (Denmark/epidemiology[MeSH Terms])
        OR (Veterans Affairs[Title/Abstract])
        OR (Premier database[Title/Abstract])
        OR (Database Management System[MeSH Terms])
        OR (National Health Insurance Research [MeSH Terms])
        OR (administrative claims[Text Word])
        OR (General Practice Research Database[Text Word])
        OR (Clinical Practice Research Datalink[Text Word])
        OR (The Health Improvement Network[Text Word])

        "Clinical Trial"[pt] 
        OR "Editorial"[pt] 
        OR "Letter"[pt] 
        OR "Randomized Controlled Trial"[pt] 
        OR "Clinical Trial, Phase I"[pt] 
        OR "Clinical Trial, Phase II"[pt] 
        OR "Clinical Trial, Phase III"[pt] 
        OR "Clinical Trial, Phase IV"[pt] 
        OR "Comment"[pt] 
        OR "Controlled Clinical Trial"[pt] 
        OR "Letter"[pt] 
        OR "Case Reports"[pt] 
        OR "Clinical Trials as Topic"[Mesh] 
        OR "double-blind"[All] 
        OR "placebo-controlled"[All] 
        OR "pilot study"[All] 
        OR "pilot projects"[Mesh] 
        OR "Prospective Studies"[Mesh]
        OR "Genetics"[Mesh]
        OR "Genotype"[Mesh]
        OR (biomarker[Title/Abstract])

  • Click ‘Search’.

4. Narrowing Down Results:

  • If the results are too numerous (>100), narrow down by the last 4 years.
  • Record the number of remaining studies for review.

5. Abstract Review:

  • Review abstracts for relevance.
  • Dismiss studies that don’t meet any of these criteria (non-original research, non-observational studies, different diseases or outcomes, methods not indicating the use of study definition).

6. Selecting Studies:

  • Mark relevant studies.
  • Record the number of remaining studies (expected to be up to 50% of the initial studies).
  • Include original validation studies if referenced. If one of the studies mention a use of a validated definition from an earlier study, please trace the referenced original validation study (even if it was out of the calendar time covered by the search).

7. Saving Your Search:

  • Save your selected studies.
  • You can save them to Clipboard for 3-day access, email the results, or export them to a citation manager.
1 Like

Meeting recording available here on MS teams and Youtube.

Notes synthesis by GPT

The meeting transcript is from an orientation session for Phenotype Phebruary 2024, an initiative by the OHDSI community. Key points include:

  1. Purpose and Participants: The meeting, led by Shoaibi Azza of Johnson & Johnson, focuses on starting the Phenotype February activities. It involves participants like Anna, Jamie Weaver, Asia, and Gowtham, coordinating the month’s work.

  2. Phenotype Working Group: Attendees, who had signed up on the OHDSI forum, were added to the phenotype working group for document sharing and collaboration.

  3. Goals and Focus: The initiative aims to understand practices in phenotype development and evaluation, specifically looking at heterogeneity in definitions, representations, and evaluation methods in observational studies from the last four years.

  4. Conditions for Study: Four conditions were chosen based on community voting - Alzheimer’s disease, small cell and non-small cell lung cancer, major depression disorder, and pulmonary arterial hypertension. Specific definitions for these conditions were provided.

  5. Methodology: The approach includes literature review, replication of definitions using the Atlas tool and cohort diagnostics, and evaluating phenotype definitions using Phevaluator. Collaborators with patient data access are encouraged to participate.

  6. Dissemination and Documentation: Results will be shared in weekly updates and a final paper. Participants are encouraged to assist in drafting the manuscript, with an immediate start on the introduction and methods sections.

  7. Literature Review Tasks: Detailed guidance was provided on conducting literature reviews, abstracting relevant information, and documenting phenotype definitions from selected papers.

The meeting emphasizes collaboration, detailed methodology, and the importance of clear definitions and evaluations in phenotype research.

Azza provided detailed guidance on conducting the literature review for the Phenotype Phebruary 2024 project:

  1. Purpose and Scope: The literature review is necessary for all four conditions being studied. Approximately 20-30 people are expected to participate in the review for each condition.

  2. Use of Templates: Two documents are central to the review process. The first is the literature review abstract form, which participants will use to go over specific details. The second is an open Word document template called “phenotype representation,” in which participants will copy the entire text or any supplementary material from the selected papers.

  3. Selecting and Saving Papers: Participants are to select papers after screening out irrelevant ones. They should use the ‘save and share’ feature in PubMed to download a text file that includes the selected papers. This file should be saved in a designated subfolder, and its name will reflect the search strategy used.

  4. Search Strategy and Criteria: A step-by-step document has been created to guide the literature search in PubMed. The task involves finding papers, filtering them, and removing those that are not relevant, aiming to end up with a list of 20-30 papers published in the last four years. These papers should include phenotype definitions, specifically for conditions like Alzheimer’s.

  5. Access and Sharing Policies: Many papers are expected to be open access. However, if a paper is not open access, participants are advised not to share anything that violates copyright or access agreements.

This structured approach aims to ensure thorough, standardized, and compliant literature review practices for the Phenotype Phebruary 2024 project.

Azza provided specific instructions on what to look for in each article during the literature review for the Phenotype Phebruary 2024 project:

  1. Study Identification: Participants should identify each study using its unique study number, which is mentioned in the literature review abstraction sheet. Keeping track of the article number is crucial for organizing and referencing the studies properly.

  2. Information Abstraction: Reviewers are asked to abstract information using a structured template focused on the phenotype. Additionally, they are to assess the representation of these phenotypes by copying and pasting relevant paragraphs discussing the phenotype into another Word document. This document will later be reviewed using qualitative methods to identify heterogeneity in reporting or representation.

  3. Use of Templates: Two specific documents are involved in the abstraction process. The first is a structured abstraction template, and the second is an unstructured document where reviewers copy the relevant text from the papers. After completing this step, there will be a group discussion to analyze these texts.

  4. Review and Analysis Process: Once the literature review is complete and the relevant texts are gathered, a table will be created for further analysis. Participants are encouraged to sign up to review one or two studies, based on their availability and capability. It’s important that those signing up for this task have access to the full text of the articles.

These steps are designed to ensure a comprehensive and detailed analysis of the literature, focusing specifically on the representation and definitions of phenotypes in the selected studies.

Here are the audience questions and answers reframed as FAQs for the Phenotype Phebruary 2024 project:

Q1: How should we communicate and ask questions during the project?

  • A1: Please use Microsoft Teams for all your queries and communications during the project. It is the recommended platform as it allows everyone to see the discussions and responses, ensuring transparency and collective knowledge sharing.

Q2: What if I don’t have access to the Phenotype Phebruary 24 folder on Teams?

  • A2: If you find that you do not have access to the necessary folder on Teams, please reach out to Shoaibi Azza, Gowtham, or Anna for assistance. They will ensure you get the access needed to participate effectively in the project.

Q3: Is it better to post questions publicly or contact someone directly for help?

  • A3: We encourage posting your questions publicly on Teams for everyone to benefit from the shared information. However, if you prefer one-on-one communication, feel free to directly reach out to Shoaibi Azza for Alzheimer’s related issues or to Anna for other matters.

Q4: How can I sign up for the literature review and abstracting information?

  • A4: Once the studies are ready, you will have the opportunity to sign up for conducting the literature review and abstracting information. Keep an eye on the updates and instructions on Teams for when and how to sign up.

Q5: What is the process for participating in replication tasks using Atlas?

  • A5: If you’re interested in replication tasks, we will be using the Atlas demo, which is available to all. A task sheet will be created with papers that are eligible for replication. You can sign up for replicating a study from this list. For these tasks, you will utilize the information provided in the literature review abstract, including codes, logic, and links, to conduct the replication in Atlas.

Q6: How should we document the assumptions made in building out Atlas phenotypes from the literature?

  • A6: When converting phenotypes from literature to Atlas, it’s understood that different individuals may interpret and implement this in slightly different ways due to varying levels of granularity in the original literature. Patrick Ryan suggested that participants should primarily copy the free text from the literature. The second tab in Atlas is used mainly for explicitly provided structured codes, and other descriptions should just be documented as they are. This approach acknowledges the inherent variability in interpreting literature into structured data formats like Atlas.

Upcoming working meeting (this is not a presentation). In this working meeting we will work on 'assessing the differences in Alzheimer definition used in observational studies (focus on the formatting/representation)'

The working meeting is open to all.

Click here to join the meeting

2/6/2024 1pm to 2pm EST

Recap from OHDSI Community Call February 6th 2024

The speakers each discussed different aspects of the progress made in Phenotype February 2024:

  • Anna Ostropolets: Highlighted the work done on Alzheimer’s Disease, focusing on the diversity in definitions and methodologies used in different studies.
  • Gowtham Rao: Discussed the evolution of terminology and its impact on clinical understanding and research in Alzheimer’s Disease.
  • Azza Shoaibi: Talked about the diversity in Alzheimer’s Disease algorithms, emphasizing the challenges in reproducing research findings.
  • Jamie Weaver: Spoke about applying new diagnostics and calculating incidence rates for Alzheimer’s Disease, and the development of a new study package.

Notes synthesis by GPT

Key discussion

  1. Importance of Detailed Reporting: There’s a consensus on the necessity for papers to provide comprehensive details about phenotypes. This includes methods, limitations, any validation metrics, and use of supplementary materials like codes or algorithms.

  2. Challenges in Phenotype Representation: Difficulty in representing phenotypes accurately due to varied data types, demographics, and diagnostic criteria. The discussion highlights the trade-offs between conveying a gestalt of the idea to having specificity to allow replication in phenotype representation.

  3. Standardization Need: There’s a discussion on the need for a standardized representation of phenotypes in publications. This would help in replicating studies and ensuring reproducibility.

  4. Data Types and Demographics: The types of data used, such as ICD codes, drugs, and cognitive screening tests, and how demographic definitions, like age ranges, impact the study outcomes.

  5. Representation Formats: While most papers use text for phenotype representation, there’s mention of visual representations in some cases, particularly for complex algorithms or when phenotype validation is the paper’s main focus.

Overall, the conversation underscores the importance of clarity, detail, and standardization in reporting phenotype algorithms to enhance reproducibility and accuracy in scientific research.

@Azza_Shoaibi shared several thoughts on phenotype representation in scientific publications:

  1. Justification of Phenotype Definitions: Shoaibi emphasized the importance of papers with a section on phenotype definition or methods, explaining how the choice of phenotype definition is justified by the clinical question being asked. She noted a gap where many papers do not explicitly state why a particular phenotype algorithm was chosen, often just taking what is available without considering the operating characteristics required by their research question. She described the current practice as mainly text-based descriptions in manuscripts, suggesting a need for more systematic documentation and qualitative analysis of this information.

  2. Replication: In her research, she focused on replication, accessing any available supplementary materials. Shoaibi pointed out that none of the papers she reviewed included a full phenotype algorithm in a computer language.

  3. Goal for Community Recommendations: She expressed a desire for the community to develop recommendations for researchers on important and necessary information or specifications to include in manuscripts, focusing on representation, formatting, and the level of detail provided.

@hripcsa expressed specific expectations for what he would like to see in a paper on phenotype:

  1. Comprehensive Information: He wants the paper to clearly outline all necessary information about the phenotype algorithm that a scientist should report. It should provide details on which elements should be included in the main body of the paper and which should go in the supplement.

  2. Easy-to-Understand List with justification: George wishes to see an easily digestible list in the paper. This list should provide straightforward information about the phenotype. He also wants evidence showing the necessity and usefulness of this list, ensuring that it is not merely filling space but is a valuable component of the paper.

@Patrick_Ryan shared several insights on phenotype representation in scientific publications:

  1. Clarity in Phenotype Communication: He expressed concern about the lack of clear intentions in phenotype descriptions across different submissions. He found it challenging to understand what the authors were trying to achieve with their phenotype descriptions.

  2. Value of Comprehensive Data: Patrick highlighted a particular table, the “Superset algorithm spreadsheet,” shared by David as compelling. This suggests his appreciation for detailed and comprehensive data representation in the context of phenotype algorithms.

  3. Need for Reproducibility and Clarity: He struggled with the hybrid nature of some phenotype descriptions, which seemed to him neither a clear clinical gestalt nor a reproducible list. He emphasized the importance of having phenotype representations that are either clearly conceptual or aimed at reproducibility, rather than a confusing mix of both.

  4. Accessibility of Information: He discussed the difficulties of accessing and utilizing phenotype information when it is only available through a Git repository and a JSON file. This format, while potentially useful for a niche group like the Odyssey community, may be too complex for a broader audience, underlining the importance of making phenotype data accessible and understandable to a wider range of researchers.

Great discussion today. Just to carry over some of the concepts of the meeting and post meeting chat.

Code counts may add additional benefit in aligning various published but incompletely documented algorithms have concerns when the codes used do not line up. (From @Patrick_Ryan) 2 additional pieces of information per code: 1) % of the cohort that has the code, and 2) % of the cohort that would be lost if the code was removed from the codelist. that doesnt address all permutations, but would it suffice to allow you to not waste time on the discussions of whether or not to include a particular code

Also wanted to add that some phenotype documentation may not be in PubMed either it’s in our OHDSI repos or other CDM repos. Here is a Sentinel example of Alzheimer disease coding trends in CCAE that cross the ICD-9 to ICD-10 era.

On February 9th 2023 the OHDSI Phenotype Development and Evaluation workgroup will do

  1. A demonstration of building cohort definitions using Atlas tool - for Alzheimer’s disease and related dementias (AD/ADRD)
  2. We will then review 2 selected cohort definitions that were replicated from the literature
  3. We will review some population level characteristics using Cohort Diagnostics (link to be provided)

The workgroup meets two times every month, on second and fourth Fridays at 9am EST. Please join the workgroup to get the meeting to your calendar.

You can also join the meeting using this link

Slides for the call
All content is located here

Notes synthesis by GPT

Patrick Ryan’s Demonstration:

  1. Accessing Atlas:

  2. Chronic Condition Warehouse (CCW) Algorithm:

    • Discussed the transition from ICD9CM to ICD10CM in October 2016.
    • Demonstrated building the algorithm in Atlas, starting with copying ICD codes and searching in Atlas.
    • Mapped ICD9CM to standard concepts in SNOMED, specifically “Alzheimer’s disease.”
    • Created a concept set (27 CCW Alzheimer’s Disease - 1883045) and analyzed concept occurrence counts.
  3. Building Cohort Definitions:

    • Replicated the 27 CCW definition in Atlas, ensuring all relevant codes were included.
    • Modified the cohort for the 30 CCW version, adding inclusion criteria for diagnosis during inpatient visits or multiple diagnoses.
    • Demonstrated using OHDSI visit definitions for inpatient criteria and overlapping periods.
  4. Third Algorithm Implementation:

    • Aimed to complete in about 10 minutes, focusing on new diagnosis and drug concept sets.
    • Imported codes from a PDF, removed placeholders, and filtered results.
    • Mapped non-standard codes to standard concepts, demonstrating vocabulary navigation and decision-making regarding broader terms.
    • Built drug concept sets, searching for specific drugs and selecting relevant ATC hierarchy categories.
    • Constructed cohort definitions considering diagnoses, drugs, and temporal logic.


  • Successfully implemented three algorithms.
  • Patrick handed over to Azza and Anna for further discussion.

Patrick Ryan provided valuable tips on mapping non-standard to standard codes in Atlas, especially when dealing with a large number of descendants. Here’s a summary of his approach:

  1. Selecting Relevant Concepts: When dealing with non-standard codes, such as ICD 9 or ICD 10, Patrick emphasized the importance of identifying and selecting the most relevant standard concepts they map to. This involves examining each non-standard code individually to understand its mapping.

  2. Handling Large Descendant Counts: A significant challenge arises when a non-standard code maps to a standard concept with a large number of descendants. Patrick demonstrated a careful approach to this issue. He suggested examining the specific source codes that map directly into the broader standard concept and its descendants. This step is crucial to ensure that the inclusion of a standard concept with many descendants doesn’t inadvertently include unrelated conditions.

  3. Making Informed Choices: Patrick stressed the need for making informed choices about whether to include all descendants of a concept. In cases where a non-standard code maps to a broader term with many descendants, he highlighted the risk of including too many records, some of which might be irrelevant. His approach was to select only the parent concept and not its descendants in such situations.

  4. Practical Verification: To verify the mapping, Patrick recommended using the concept set browser in Atlas. This tool helps to understand how the vocabulary rolls up and corresponds to records. It’s also useful for identifying any potential vocabulary distortion that occurs due to standardization.

  5. Case Example: In one instance, Patrick encountered a non-standard code that mapped to a broad term like ‘Dementia’. The term had a large number of record counts, including many descendants. To avoid including an excessively broad range of conditions, he chose to select the parent concept (‘Dementia’) without its descendants.

Patrick’s approach highlights the importance of careful examination and selection when mapping non-standard codes to standard ones in Atlas, particularly in the context of large descendant counts. This methodology ensures that the created concept sets are both accurate and clinically relevant.

Additional tips:

  1. Starting with Simple Cases: Patrick started with the ‘nicer’ case of the CCW algorithm, showing a straightforward mapping process. He then moved on to more complex cases, like the Harris algorithm, which involved additional challenges like excluding specific codes (e.g., Lewy body) and dealing with codes that couldn’t be cleanly excluded due to mapping limitations.

  2. Handling Complex Mappings: In real-world situations, replicating a paper’s algorithm might require decisions about using standard concepts that are close but not perfect matches or implementing a source-code-based definition. This is especially relevant when the source vocabulary differs.

  3. Dealing with Shorthand and Vocabularies: Patrick criticized the use of shorthands like ‘ICD 290.2x’ because they can obscure the specific codes covered. He recommended explicitly listing all concepts and source codes with their descriptions to avoid inference and gaps in understanding.

  4. Directional Mappings in OHDSI Vocabularies: The OHDSI vocabularies create hierarchical relationships where standard concepts should be either synonymous or broader than the source codes mapping into them. Patrick noted that while most mappings from ICD to SNOMED were 1:1, there were exceptions where different ICD codes mapped to the same standard concept, making it challenging to differentiate them.

  5. Verifying Mapping Coverage: It’s essential to confirm that a concept set covers all relevant codes from both ICD9CM and ICD10CM. Patrick demonstrated this by building a concept set and ensuring it included all necessary codes.

  6. Verification and Problem Solving: When replicating a paper, it’s crucial not to blindly follow mappings but to verify the included source codes. If problems are found (like unwanted codes being included or missing codes), steps should be taken to correct these issues.

  7. Searching for Clinical Ideas: A more straightforward approach is searching for the clinical concept (e.g., ‘Alzheimer disease’) directly in SNOMED and including relevant descendants, excluding those not fitting the clinical idea. However, when replicating papers, Patrick used their codes but also emphasized building the best standard concept set possible, using SNOMED to navigate concepts and descendants.

Notes synthesis by GPT

  1. Introduction and Setup:

    • Azza begins by sharing the URL for Cohort Diagnostics. https://results.ohdsi.org/app/16_PhenotypePhebAlzh
    • She mentions running the algorithms on approximately 7 JNJ data sources.
    • A brief orientation of the Cohort Diagnostics tool is provided, highlighting the navigation and dropdowns on the left side.
  2. Recap of Previous Work and Initial Observations:

    • Azza recaps Patrick’s work and discusses the differences between 30 CCW and 27 CCW cohort definitions.
    • She demonstrates the impact of these definitions on patient counts across various databases, noting significant variations.
  3. Incidence Rate Analysis:

    • Azza selects a data source (DoD) and analyzes the incidence rate stratified by age, gender, and calendar year.
    • She simplifies the analysis by excluding extreme ages and data without gender, focusing on the age group at risk for Alzheimer’s disease.
  4. Index Event Breakdown:

    • The breakdown of entry events is examined, showing the distribution of codes allowing individuals into the cohort.
    • Azza discusses the implications of including non-standard codes, such as unspecified dementia.
  5. Comparison of Cohort Characteristics:

    • The demographics of different cohort definitions (Harris vs. CCW 27) are compared.
    • Observations are made regarding the age and gender distribution across these cohorts.
  6. Additional Analysis and Conclusion:

    • Azza mentions the possibility of extending this comparative strategy to other domains such as visits, disease, and drug data.
    • She concludes by emphasizing the importance of selecting the right algorithm for study results.

Azza shared the following key tips and strategies for interpretation:

  1. Algorithm Selection and Cohort Definition Impact:

    • Understand the differences between various cohort definitions (e.g., CCW 30 vs. 27 CCW vs. Harris) and how they impact patient counts and characteristics. More restrictive algorithms like CCW 30 might yield smaller, more specific cohorts.
  2. Analyzing Incidence Rates:

    • Pay attention to incidence rates across different databases and stratifications (e.g., age, gender). Adjusting for factors like age and gender can simplify analysis and provide more relevant insights.
  3. Index Event Breakdown:

    • Examine the distribution of codes that allow individuals into the cohort. This helps in understanding which entry criteria or codes contribute most to cohort formation.
  4. Use of Standard and Non-Standard Codes:

    • Investigate both standard and non-standard codes in cohort entry criteria. This can reveal if any unwanted codes are being included, affecting the cohort’s composition.
  5. Comparative Cohort Analysis:

    • Compare demographics and other characteristics across different cohort definitions to understand variations and similarities. This includes examining the distribution of age, gender, and potentially other factors like disease or treatment types.
  6. Evaluating Algorithm Stability Over Time:

    • It’s crucial to ensure the stability of the algorithm, especially in studies spanning multiple years. Changes in coding systems or treatment guidelines over time can affect algorithm performance.
  7. Nested Cohort Analysis:

    • When dealing with nested cohorts, focus on patients who enter one cohort but not a more specific one. This helps in diagnosing the effectiveness of each cohort definition and understanding the overlap and differences between them.
  8. Marginal Population Analysis:

    • In cases of significant overlap between algorithms, analyze the ‘marginal population’—those included in one algorithm but not the other. This comparison helps identify true or false positives and understand the unique contributions of each algorithm.
  9. Distribution Analysis for Demographic Consistency:

    • Analyze the distribution of demographics like age and gender to ensure consistency across cohorts. This helps in identifying if a cohort is inadvertently skewed towards a particular demographic group.

[10:56 AM] Shoaibi, Azza (Guest)

Phenotype Phebruary 2024 Update

HI all, would like to give an update of the activities.

Alzheimer’s Disease

  • Task Completed: Studies identified, Literature Review, Replication of definitions in Atlas, Review in Cohort Diagnostics, Phevaluator Run.

  • Next Step: @Azza_Shoaibi will initiate a document to summarize findings. This document will be open for contributions.

  • Further Action: @jweave17 to share a package of Cohort Diagnostics (CD)

and incidence rate, including all Alzheimer’s cohorts, for other data partners to execute.

Lung Cancer

  • Task Completed: Studies identified.

  • Current Status: Literature Review ongoing. @agolozar, could you please guide contributors to the identified studies to help in abstraction.

Major Depressive Disorder

  • Task Completed: Studies identified.

  • Current Status: Literature Review yet to start

  • Next Step: @aostropolets could you please provide the list of studies for literature abstraction when ready. @aostropolets might also start a group chat for people who signed up for MDD

Pulmonary Arterial Hypertension

@agolozar I’m trying to keep up this week but I did note this article on NSCLC out of South Korea that did use the OHDSI CDM work published today: Epidemiology and Outcomes of Non–Small Cell Lung Cancer in South Korea

Today at 9am EST @agolozar and members of the Phenotype Phebruary 2024 team will lead a conversation on the work done so far on Non Small Cell Lung Cancer.

Relevant files

  1. Structured abstraction of phenotype algorithms on NSCLC
  2. Initial search results

Key summary ( Notes synthesis by GPT)

  1. Complexities of Defining NSCLC and SCLC: The meeting highlighted the challenges in defining and categorizing non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC). Asieh Golozar emphasized that the term NSCLC encompasses many distinct diseases, each with its unique characteristics, impacting treatment and patient outcomes.
  2. Evolving Treatment Paradigms: Asieh also noted the dynamic nature of lung cancer treatments, such as the introduction of checkpoint inhibitors, and how these changes necessitate continuous updates in research categorizations and definitions.
  3. Phenotyping and Data Sources: Azza Shoaibi discussed the focus on subgroups in lung cancer research, the importance of data source selection (pathology/histology vs. claims/EHR data), and the challenges of heterogeneity in coding practices and logic within studies.
  4. Specialization in Lung Cancer Research: Gowtham Rao highlighted that research tends to focus on specific subtypes of NSCLC, usually in the context of treatment, rather than on NSCLC as a standalone subject. He also mentioned the categorization of NSCLC into different stages and the concept of phenotype as a reusable asset, requiring continuous iteration and improvement.
  5. Data Source Reliability and Study Inclusion Rules: Gowtham Rao summarized that out of the reviewed studies, a portion focused on SCLC, and the majority on NSCLC. He noted the reliance on various data sources, with some studies using histology and pathology data and many depending on generic claims and EHR data sources. Even in studies without specific histology/pathology data, lung cancer is often identified through ICD 9 and ICD 10 codes

@agolozar Literature Review on Lung Cancer: She described the process of reviewing literature on PubMed, focusing on abstracts to limit publications for a full-text review. Out of 45 papers reviewed, 40 had relevant codes or information about how the phenotype was developed. She emphasized that none of these papers definitively defined what is non-small cell lung cancer or small cell lung cancer, highlighting the lack of a clear definition in the research community​​. Changes in Lung Cancer Treatment: Asieh mentioned that checkpoint inhibitors, introduced around 2017, revolutionized the treatment of non-small cell lung cancer. She noted that treatments for lung cancer are constantly evolving, necessitating changes in the definitions and categorizations used in research. For instance, pembrolizumab was used for non-small cell lung cancer patients with certain PD-L1 levels, and osimertinib was used for patients with specific EGFR mutations​​. Treatment-Based Phenotyping: She discussed the practice of using specific treatments to phenotype lung cancer, such as using etoposide and other drugs in combination with radiotherapy and surgery. She noted that these treatment-based phenotypes need to be continually updated as treatments evolve​​. Variability in Phenotype Descriptions: Asieh pointed out the variability in phenotype descriptions across different studies. She observed that most studies used registries, specifically SEER Medicare, for data due to the availability of correct codes for identifying lung cancer. However, even within these datasets, there was significant variability in the codes used

@Gowtham_Rao Data Sources and Lung Cancer Studies: He summarized that out of the reviewed studies, a few were focused on small cell lung cancer (SCLC), and the rest on non-small cell lung cancer (NSCLC). He noted that some studies relied on data sources with histology and pathology, while many did not. Instead, they used generic claims and EHR data sources, which relied on ICD 9 and ICD 10 codes, even when studying complex clinical ideas such as sub-treated groups​​. Focus on Specific Subtypes in Lung Cancer Research: Rao emphasized that the term “non-small cell lung cancer” is not commonly used as a standalone subject in research. Instead, studies tend to focus on specific subtypes within NSCLC, usually in the context of treatment. This insight highlights the specialization and detailed focus of current lung cancer research​​. Advanced Stages of NSCLC: He mentioned that NSCLC is categorized into different stages (stage one to four), including substages, while stage is not used in SCLC and mostly classified as advanced or not​​. Phenotype as a Reusable Asset: Rao discussed the concept of the phenotype as a reusable asset, acknowledging that there are multiple opinions in the community regarding this concept. He reflected on the past approach where there was an attempt to establish a gold standard phenotype that could be universally used, which was later deemed ineffective. The current approach involves iterating and improving upon phenotypes, where the phenotype library serves as a starting point that researchers can adapt based on their study design to ensure minimum quality standards. Work done by community: Rao discussed the approach taken by the community in researching non-small cell lung cancer (NSCLC) and small cell lung cancer. He mentioned that a thorough literature search was conducted and the findings were abstracted. Rao observed that none of the papers solely focused on NSCLC or small cell lung cancer in general. Instead, the studies were more specific, dealing with specialized subtypes of these cancers. These subtypes were often defined by factors like the stage of the cancer (advanced stage three or four) or specific genetic mutations.

@Azza_Shoaibi Focus on Subgroups in Lung Cancer Research: Azza noted that applied research in lung cancer, particularly non-small cell lung cancer (NSCLC), is highly focused on subgroups of the disease. This focus is due to the pathological and clinical distinctiveness of NSCLC, which makes it not only identifiable but also relevant to research that aims to understand specifics of subgroups within the broader category of lung cancer​​. Data Source and Phenotyping in Oncology: She emphasized that the first decision in phenotyping for oncology is determining whether you are dealing with pathology and histology data sources or just administrative and claims data. This distinction is crucial because it affects the availability and specificity of codes used in the data​​. Color Coding in Data Analysis: Azza mentioned using color coding in data analysis to differentiate data sources with histology and pathology information from those without. This approach helps identify phenotypes that work with registry data and those that work with claims data​​. Handling Inclusion Criteria and Identifying Consistency: She discussed the importance of handling the inclusion criteria for lung cancer using ICD 10 and 9 codes, and assessing their consistency across studies. She suggested exploring the possibility of identifying meaningful ‘buckets’ or categories within the data that are clinically significant and could be defined by specific sets of codes​​. Heterogeneity in Code Use and Logic: Azza pointed out the heterogeneity in how different studies use ICD codes and the logic behind them, such as the number of codes required, whether the codes need to be from inpatient or outpatient settings, etc. This heterogeneity poses challenges in creating a consistent approach to phenotyping in lung cancer research​​.

@Christopher_Mecoli raised a concern about the challenges in oncology due to the very heterogeneous nature of phenotypes, which often encompass dozens of sub-phenotypes. He noted that this field is dynamic, changing every few months, and wondered how this complexity could be managed within a phenotype library. Mecoli suggested the idea of having a generalized parent phenotype with several sub-phenotypes under it, which would be iterated over time​​.
In response, @Gowtham_Rao acknowledged the validity of this concern, emphasizing the broader concept of the phenotype as a reusable asset. He noted that there are multiple opinions within the community on this issue and that there hasn’t been a consensus reached yet. Rao mentioned that the community previously believed in a gold standard for phenotypes that could be universally applied across all data sources, but this approach was later deemed ineffective. He pointed out that treatments and paradigms in oncology can change, necessitating continuous iteration and improvement of phenotypes. Rao explained that the current purpose of the phenotype library is to serve as a starting point for studies, allowing researchers to pull down and use these phenotypes, with the flexibility to modify them based on their study design and to ensure a minimum quality standard.

@Andy_Kanter raised several points in the discussion: Observational vs. Clinical Trial Data: He questioned the distinction between observational data and data collected specifically for clinical trials, especially concerning missing data in observational studies. Andy highlighted the challenge in balancing specificity and comprehensiveness when data is not consistently captured in routine care​​. Categorization of Patient Populations: Andy suggested that a potential approach might involve categorizing different patient populations into broader buckets. For example, using histology codes to represent advanced disease or categorizing based on driver mutations or alterations. This approach could help in differentiating patient populations for phenotyping purposes​​. Utility of Broader Categories: He also raised a question about the practical utility of these broader categories or ‘buckets’. Andy pondered whether these generalized groups would be useful, especially if research is focused on specific subtypes, such as those with certain genetic alterations. He implied that while such categorization might be clinically important, its relevance to specific research needs may vary​

Week 4 - Pulmonary Arterial Hypertension

[10:54 AM] Shoaibi, Azza (Guest)
PAH Data Extraction Step 2
Dear all,

We now have the first set of scanned papers for pulmonary arterial hypertension ready for Step 2: data extraction!
If you want to contribute, please go to Papers for data extraction.xlsx . You can also access it by going the
(Files-> Phenotype Phebruary 2024 → PAH (Week 3) → 1. Literature Review → 2. STEP 2 , open the file called “Papers for data extraction” and put your name down for as many or as few papers as you would like.


You will need to read the paper and complete 2 forms: LR abstraction excel AND Phenotype Representation doc, also in the same folder. We would have the data ready for cohort replication by Friday Feb 23th EOD.

Many thanks for participating and looking forward to conversations on what we learn!

Papers for data extraction.xlsx

Meeting today at 10 EST to build MDD cohorts
here is the link for the call at 10 AM eastern