OHDSI Home | Forums | Wiki | Github

Phenotype Phebruary 2023 - Week 1 Discussion - Phenotype Peer Review

Thanks @Christian_Reich . The steps you lay out seem reasonable as part of the phenotype development process , and directionally quite aligned with what I tried to introduce on the Phenotype Phebruary kick off call (video here). Namely that we should try to get in habit of developing our phenotypes with evaluation in mind, specifically to have each step aimed at reducing a source of error:

  1. Identify the persons who might have the disease
  • Aim: Increase sensitivity
  • Task: Create inclusive conceptsets used in cohort entry events
  1. Restrict persons who likely do not have disease
  • Aim: Increase specificity / positive predictive value
  • Task: Add inclusion criteria
  1. Determine the start and end dates for each disease episode
  • Aim: Reduce index date misspecification
  • Task: Set exit strategy, refine entry events and inclusion criteria

I would assert that we can and should build better diagnostics for each of these 3 steps: 1- diagnostics that can help us identify sensitivity errors by finding concepts we haven’t yet included which would increase our cohort size without substantially changing the composition of the characteristics of the cohort; 2- diagnostics that help us identify specificity errors by finding concepts which, if included as inclusion criteria based on a requirement of either their presence or absence, would decrease our sample size and would change the composition of the the characteristics of the cohort (under the premise that those now excluded are different people than those who remain); 3- diagnostics that help us understand the distribution of recurrence and the duration between successive entry events, so that one can determine how to differentiate follow-up care from new ‘incident’ episodes of disease; and 3- diagnostics that observe the prevalence of symptoms, diagnostic procedures, and treatments in time windows relative to the assigned index date to determine what revisions to entry events may be warranted to reduce index date misspecification.

At the end of that development and evaluation cycle, if you’ve iterated to increase sensitivity, then increase specificity, then decrease index date misspecification, on one or more databases, then you should be left with a final phenotype algorithm that warrants consideration from the community. In an preferred state, the sensitivity, specificity, positive predictive value, and index date misspecification would be explicitly quantified, but at a minimum, can be discussed in an evaluation report summarizing the algorithm.

For me, I’m just unsettled on what the incremental value of the peer review process is. If it’s simply to verify that some process has been followed, then we’re just talking about a checklist to ensure the submission has the required elements (without subjective judgement about its contents). But if we imagine peer reviewers have to execute the development/evaluation steps themselves to see if they reach the same conclusion, then I think that’s a unnecessary burden on the reviewer and quite unlikely to yield a successful replication. I think of it akin to peer review in publications: the reviewer is supposed to be assessing the scientific validity of the submission: that the methods are sufficiently described and appropriate to generate the results presented, and that interpretation of results is put in sufficient context with what is known and what can be inferred from the data. But they are generally not responsible for independently reproducing the results nor should they try to change the original aim of the study (though they can opine to the journal editors of whether they find the topic of relevance to the field). It seems to me, in our world of phenotyping, its ALWAYS appropriate to aim to develop a phenotype algorithm for a particular clinical target (so we don’t need to question intent or relevancy), so what does that leave for the peer reviewer to do?

The OHDSI Phenotype Development and Evaluation Workgroup has made progress in providing guidance on clinical description (note this is still work in progress and in DRAFT stage), however the key idea .

In Phenotype Phebruary 2022 - I introduced the idea of clinical description as the most important step, and now we assert this is a required first step.

Reason for this assertion is: if we dont spend the time to understand the clinical target being phenotyped, then in the absence of the clinical understanding of a) what it is, b) what it is not, c) how does it start, d) how does it stop, e) how is it managed – we cannot recognize measurement errors in our

  • entry event (to improve sensitivity and index date timing - by including all relevant events),
  • inclusion rules (to improve specificity - by removing ineligible events),
  • exit criteria (how long does it take to resolve).

Today there is no debate on the importance of stating upfront the target clinical ideas description, but we are still struggling with focusing the content as needed by the phenotyper. Clinical description is like a case definition but at the population level and required scientific practice.

Clinical description is important for the peer reviewer. As seen in the interactive session - Appendicitis - peer review - because the clinical description stated upfront that the clinical idea of appendicitis included the spectrum of inflammation of appendix - the peer reviewers agreed that ‘Crohns disease of appendix’ was appendicitis. Clinical description justified design choices made by the phenotyper.

It is the clinical description that articulates the target clinical idea we are trying to model with our cohort definition and justifies design choices.

Soon, the Forum software is going to give me that infamous “This subject seems to be really important to you. But consider other people also weighing in”. Will happen soon. :slight_smile:

We hear you. I don’t know either. Maybe it is a euphemism for ensuring that the process is indeed followed and all aspects are thought through? Because few people have the clinical knowledge of all conditions and procedures, AND also know how these facts end up in the data by the data capture system?

I find that example proves the opposite and is disconcerting. Because nominally that assertion is correct. But let’s say you want to study the effectiveness of one antibiotic (curative or preventative for the surgery) compared to another for some bad outcome (perforation or death) - you would never include Crohn’s, because that’s just a completely different etiology. However, if you want to study, say, the correlation between inflammatory blood markers and perforation you might include Crohn’s very well. Which means: The research question determines the clinical description, and there is no phenotype without stating the question per se.

What does that mean for our work?

If we bring this two together I think we all agree that the sequence is:

Research question → define clinical idea → phenotype development and evaluation.

concretely e.g., if this is a clinical trial, we would say the experiment is ‘acute appendicitis among adult persons without crohns disease or chronic appendicitis randomized to either antibiotic A or antibiotic B’. Then you would create patient eligibility rules for the defined clinical idea ‘acute appendicitis among adult persons without crohns disease or chronic appendicitis’ such as should be adult, should have a new diagnosis of appendicitis, should not have chronic appendicitis or crohn disease history - which would go to sites that recruit patients. We would also follow the same sequence in observational data - replace patient eligibility rules to cohort definition logic.

If the above sequence is approximately correct, then the path to phenotype development and evaluation is

’define clinical idea → phenotype development and evaluation’

i.e. there is thus no logical dependency between phenotype development and research question.

In the above example - you would specify in the clinical description to say - ‘persons with appendicitis who have co-occurring crohns disease involving appendix’ are not eligible.

This clarification in the clinical description - would justify a rule in the cohort definition logic to remove persons with crohns disease involving appendix. It would also make it clear to the peer reviewer, why a certain inclusion rule was applied (i.e. exactly 0 persons with crohns disease of appendix).

Now the evaluation is geared towards the target of ‘acute appendicitis among adult persons without crohns disease or chronic appendicitis’ and not ‘appendicitis’. i.e. finding evidence that there are persons in the data source who have ‘crohns disease of the appendix’ but those persons are not in your cohort - would not mean sensitivity errors, while seeing evidence that there are persons in the cohort with ‘crohns disease of appendix’ would indicate specificity error.

Compare to just ‘appendicitis’ - there the finding evidence that there are persons in the data source with ‘crohns disease of the appendix’ and that those persons are not in your cohort - would be a negative hit to sensitivity.

‘crohns disease of the appendix’ is a good for appendicits, but bad for ‘crohns disease of the appendix’ is guided by the clinical description i…e target clinical description which are different for ‘appendicitis’ and ‘acute appendicitis without crohns disease of appendix’

@Christian_Reich , I think I have a different take on the appendicitis example.

My perspective: there are two completely valid clinical targets that could be unambigiously described in a clinical definition: 1- acute appendicitis without prior chronic inflammation, and 2- inflammation of appendix (acute or chronic). ‘Crohn’s disease of appendix’ may be included in #2 but would likely be excluded from #1 in the clinical description. But there is no judgement about which of these clinical targets is ‘right’ or ‘wrong’, both are legitimate and can be useful for different research questions.

For each of these clinical targets (with their associated descriptions), it should be possible to develop and evaluate a phenotype algorithm, and it would seem perfectly fine to me that an OHDSI phenotype library would house both target/algorithm pairs. I don’t see any disconnect here.

It’s the responsibility of the researcher to ensure they are picking the appropriate phenotype from the library that suits their research need, but we don’t need to know that researchers question before we develop the phenotypes. In my opinion, this example seems more to point to the need to reduce ambiguity in the clinical description (and possibly ensure that our label for the phenotype is sufficiently descriptive so as to avoid confusion for future researchers).

One small addition. If we come up with thresholds for estimated sensitivity and specificity or PPV, I think that mainly addresses misspecification of the variance of the desired clinical estimate where the phenotype will be used. And I expect the thresholds need not be overly narrow (we can give some slack).

More important, and what I am beginning to work on, is differential measurement error that biases the desired clinical estimate. That requires knowing the clinical hypothesis. Right now at least, that is mostly a human review, be it phenotype developer or eventual phenotype user. I can imagine coming up with some heuristic number for estimated bias or sensitivity, but I would have a hard time really trusting it.

So we can consider it a post-phenotyping process, except presumably the better the sens. and spec. or PPV, the less opportunity for differential error.

Wow! This is a REALLY good discussion.
You all give me some good thoughts to improve/clarify the “peer review” step. As I was reading this I found myself agreeing with @Patrick_Ryan @Christian_Reich @Gowtham_Rao and @hripcsa all at the same time!!! Here are my few additional thoughts:

  • In a future state, where we have a clear, systematic, and reproducible process relying on objective measures/threshold/diagnostics - there will be NO need for a peer to review. As @Patrick_Ryan put it, the review can be a check list to ensure that the process has been followed.
  • However, until then, I see the peer review is one possible imperfect way to stress test some of the subjectivity in the process and have it questioned by an independent person.
  • Even in this imperfect way, a phenotype peer reviewer (like a publication peer reviewer) is NOT needed to redo the study/phenotype. Instead, she can use the submitted material to, 1) clarity of the clinical description and it’s consistency with the cohort definition; 2) review the interpretation made by the submitter of the data presented (cohort diagnostic, pheValuator, and maybe patients profile) and either agree or disagree on the submitter inferences on the measurement error; 3) finally, either a. accept the phenotype definition as one legitimate entry into the OHDSI phenotype library or, b. reject it based on the evidence of substantial/unacceptable measurement error or incomplete/inaccurate assessment of such errors [Here is a great example of a phenotype development and evaluation publication that concludes with a recommendation to NOT be use such phenotype definition in observational studies Phenotype Algorithms for the Identification and Characterization of Vaccine-Induced Thrombotic Thrombocytopenia in Real World Data: A Multinational Network Cohort Study - PubMed ], c. Accept with modification (this is a great way to recommend changes to the documentation to improve transparency, clarity, and reusability).
  • Now, I have served as a peer reviewer 2 times (Appendicitis this week, and Acute Kidney Injury last year). In both scenarios - I found myself redoing some of “phenotype development” steps to review the submitted phenotypes. While in theory I should not need to do that, I still did it because - like most of you - I trust the phenotype I worked on! I am hoping that by the end of this month, we get closer to harden our process (and maybe documentation) - well enough to make the peer review a simple step to verify that the measurement error is well assessed and is quantitatively small that is within acceptable range for the OHDSI Phenotype Library.

Sure. If in your clinical description you concluded the job of resolving all those potential choices, and made that clear in the text. But that means you can’t run a Phenotype Phebruary and go “who wants to do appendicitis?” There is no such a thing as “appendicitis”.

Appendicitis is one example where your phenotype requires iterative workup, with potential splitting. The notorious AMI - ischemic heart disease - unstable angina - stable angina continuum is the same. Or the ischemic - hemorrhagic - ischemic not turning hemorrhagic stroke. Or the acute pancreatitis - cholecystitis - chronic pancreatitis. Or the anaphylaxis - anaphylaxis but not from an insect bite. And so on. These are only the examples I remember from the recent past. I guess every single Phebruary cohort has to be split into several ones - depending on what you want to study.

I think I am saying exactly the same thing. Yes, you can parse all those ambiguities and add them to the clinical description. Except you would multiply each library entry and make many out of them. Maybe that’s what we should do: a hierarchical library.

The clinical description is 1000 times better than what we have now - wishy washy descriptions in the papers. But still they fall short of what is needed: The poor analyst wretch coming to the library has no idea what pitfalls appendicitis can have. And that he has to decide about the Crohn’s. And the chronic appendicitis. And maybe more.

Currently, the clinical description is a yada yada textbook summary of the disease, and how you manage it, and that the patient will be in the ER, and so on. Instead, it should say “Appendicitis, but no chronic condition” or “Acute appendicitis, including an acute eruption of a chronic condition”. Since we are doing this exercise, it would be a huge help to the community to be told what key choices they have to make.

1 Like

Differential error between what? Different phenotype versions?

Hm. How do you do that, @Azza_Shoaibi? Our opinions are only partially congruent. :slight_smile:

If for the same phenotype, the sensitivity and specificity differ by cohort (eg treatment group), then you can get biased estimates in whatever study you are doing. I guess basically, some aspect of the error would have to be a confounder, correlated with both treatment and outcome. We normally discount the possibility, but that’s what would cause the most trouble.

@hripcsa are you referring to selection bias? i.e., If T has a sensitivity of 60 and PPV of 40, and C has a sensitivity of 99 and PPV of 99…

Then there are person’s in T who don’t have the disease/treatment but the cohort definition algorithm is classifying them to have it, ie they may have something else that’s we don’t want but we think they have something we want. ie measurement error.

In such scenarios, the measurement errors maybe so bad that we can’t achieve exchangility state (like randomisation) with for example our propensity score methods. In the absence of such a state, there will be residual confounding causing a lot of trouble.

Right, @Gowtham_Rao .

The question is what is the question. Atrial fibrillation as an adverse event may require a different phenotype from Atrial fibrillation requiring DOAC therapy. Thus necessitating additional clinical descriptions and additional reviews and diagnostic metrics.

We always need to start with the research question which may alter the clinical idea based on the research question and we may discover that our data is not fit-for-purpose for a given research question. If I don’t believe you are diabetic until I see a HOMA-IR or you’re not depressed until I see a HAM-D then our data will not be able to define a clinical idea needed to address a highly specific research question.

On the topic of assessing differential misclassification how in rare exposures (T) and rare outcomes do we acheive the precision necessary to conclude that 60 and 99 are different. Now these are extreme differences but 70 and 80 are different but only if you have enough data to conclude they are not.

Agreed. But given a clinical idea/description can we now phenotype blinded to research question.

I think this is one big fat lingering open debate question. Other people have brought it up: @agolozar, @Daniel_Prieto, еtc. if something is a key input, how can you “blind” it from that?

Or do you mean “abstract it from a concrete question”? “Make it generic to typical types of questions”? Introduce “flavors” in @Daniel_Prieto’s words?

if so, we have to say what these flavors or types of research questions are, and state them in the description.

Many times you all are saying the same things but somehow focusing on different points of the same argument

Summary of the discussion Week 1 on peer review?
(all paraphrased)

  1. In the presence of objective diagnostics with prespecified decision thresholds, we don’t need peer review - but only a check list that a process was followed. However, today we do not have objective diagnostic or a checklist. @Patrick_Ryan
  2. Peer review is not pitting one persons subjective opinion on another persons work with little-to-no empirical basis to reconcile differences. @Patrick_Ryan
  3. Peer reviewer role - options for peer reviewed could be a) accept, b) reject, c) revise and resubmit;
  4. Use of peer review may not have any more advantage over a superficial confidence @Patrick_Ryan
  5. Future researcher “to determine if the existing phenotype is fit-for-purpose for the new research question or different database.” @Kevin_Haynes
  6. In the absence of objective diagnostics, "By soliciting another peer scientist to provide their independent perspective - we hope to discover measurement errors that we would not have otherwise seen " @Gowtham_Rao
  7. Checks if a) clinical idea is described and umabigious, b) cohort definition logic and clinical descriptions are concordant, c) cohort definitions are tested. @Gowtham_Rao
  8. we apply the OHDSI principles of a systematic approach. If we do undefined “reviews” we add more pain to the current situation. @Christian_Reich
  9. Job of the peer is to ensure the process to pressure test criteria to correct misclassification - index criterion, sensitivity problem, specificity problem,
  10. We should develop phenotypes with evaluation in mind - i.e. each step to reduce a source of error. Proposes a diagnostics informed approach for the process @Patrick_Ryan
  11. Peer reviewers repeating the steps to see if they come to the same conclusion - probably unnecessary. Uses analogy of the publication peer reviewer @Patrick_Ryan
  12. in phenotyping we don’t need to question intent or relevancy of phenotyping a clinical target @Patrick_Ryan
  13. It is the clinical description that helps justify design choices in cohort definition @Gowtham_Rao
  14. Peer reviewer should not pass judgment on whether the described clinical target (as in clinical description) is right or wrong e.g. crohns disease of appendix @Patrick_Ryan
  15. Estimating differential error maybe a post phenotyping process @hripcsa
  16. Until objective decision threshold place diagnostics are in place, peer review may be one possible imperfect way to stress test some of the subjectivity in the process and have it questioned by independent person @Azza_Shoaibi
  17. Phenotype peer review does not need to redo the phenotype development, but should be able to come to a recommendation a) accept, b) reject based on submitted material. @Azza_Shoaibi
  18. If we parse all ambiguities add add them to clinical description, we would multiply each library entry @Christian_Reich
  19. The clinical description is 1000 times better than what we have now - wishy washy descriptions in the papers. But still they fall short of what is needed @Christian_Reich
  20. The question is what is the question. Atrial fibrillation as an adverse event may require a different phenotype from Atrial fibrillation requiring DOAC therapy. Thus necessitating additional clinical descriptions and additional reviews and diagnostic metrics. @Kevin_Haynes
2 Likes

Coming in as someone who is not a phenotyper, but rather a person often relying on their work: if the goal of OHDSI is generating evidence, then phenotypes should be seen as a means to this end.

My impression though is that the general discussion around phenotypes often treats them as an end in themselves. I would have thought that, alongside decisions around choice of study design and appropriate statistics etc, decisions around phenotypes surely must also be context-dependent. Without knowing the context for which a phenotype will be used, won’t the response of the reviewer often simply be “it depends”?

Going back to the original question of what criteria are important for reviewing a phenotype, I would think that two important elements that need to be considered at the stage of reviewing a new phenotype or the repurposing of a previously used one are:

  1. the research question for which a phenotype is going to be used for, and
  2. the databases which will be included in the study to address the research question

(Maybe this latter is worth an entirely separate discussion, especially now that the success of OHDSI has brought with it an increasing variety of types of data sources!)

Absolutely correct! Let’s talk about this topic week 3 of Phenotype Phebruary 2023!

t