OHDSI Home | Forums | Wiki | Github

Phenotype Phebruary 2023 - Week 1 Discussion - Phenotype Peer Review

Today and for this week, we want to start a discussion about Peer Review. In particular, we want to start answering the following questions:

  • What criteria should a peer reviewer consider while reviewing a phenotype?
  • Does the fact that an independent person did a review increase the trust or usability in the cohort definition?

On top of checking the accuracy of the clinical exposĂ©, aren’t the criteria sensitivity, sensibility, timelines and correct exit? Isn’t that what we need from a phenotype?

Thanks @Azza_Shoaibi for kicking off this Phenotype Phebruary 2023 week 1 discussion topic. And thanks @Christian_Reich for chiming in with your perspective. I really hope we’ll hear from others on this important topic.

I really appreciate the careful thought and consideration that @Evan_Minty and @Andrea_Noel put into their peer reviews on pancreatitis and anaphylaxis, respectively, as well as the great discussion that took place on the interactive peer review of appendicitis.

That said, I’m still myself undecided about what (if any) role a peer review should play in the phenotype development and evaluation process.

My current thoughts (which I’ll try to share here provocatively in the hopes to stimulate some debates with others in the community who may have difference of opinions):

A phenotype algorithm (aka cohort definition) is a specification for how we identify a cohort, which is a set of persons who satisfy one or more inclusion criteria for a duration of time. For ease of simplicity in this discussion, I’ll focus on the task of phenotyping a disease.

For a given disease (which needs to be precisely defined in a clinical description), the task of the phenotype algorithm is simply to use the data available to identify which persons have the disease and when.

We can try to measure the extent to which a phenotype algorithm accomplishes this temporal classification task in a given database. There are three primary types of measurement error that we want to be the lookout for:

  1. sensitivity error - the phenotype algorithm has failed to identify persons who truly have the disease
  2. specificity error - the phenotype algorithm has failed to identify persons who truly don’t have the disease (incorrectly classified a person as having the disease when they do not)
  3. index date misspecification - the phenotype algorithm incorrectly assigns the start or end date of a person’s disease episode

Given a database estimate of sensitivity/specificity/positive predictive value and a quantification of index date misspecification (e.g. what % of persons have incorrect date?; what is the expected duration between real and assigned date?), one should be able to evaluate if the phenotype algorithm is adequate for use in that database by determining that calibration for the measurement error on the evidence statistic of interest (such as: incidence rate in a characterization study; relative risk in a population-level effect estimation analysis; AUC and calibration in a prediction study) is sufficiently small that it will not substantially impact the interpretation of the evidence. More ideally, this would be an objective diagnostic, with pre-specified decision thresholds, such that phenotype algorithm adequacy could be determined explicitly by the measurement error estimates. For example, a decision threshold could be something like: accept phenotype algorithm if ABS ( LN (positive predictive value / sensitivity) ) < 0.69; else reject phenotype algorithm. This decision threshold ensures that calibration of the incidence rate of the phenotype as an outcome in a characterization study doesn’t result in more than a 2x change in the observed incidence in either direction.

If we could agree to objective diagnostics and pre-specified decision thresholds to determine phenotype algorithm adequacy based on the measurement error estimates for a given database (and its still a big ‘if’ that requires more methodological research like what @jweave17 is doing with @Daniel_Prieto at Oxford), then what is the role of peer review?

In my opinion, the job of the peer reviewer is NOT to question the choices within the algorithm, to review the codelist, to speculate on ways the algorithm could be tweaked, or to project whether the algorithm should be used by other researchers in other databases
all of these tasks are simply pitting one person’s subjective opinions against another, with little-to-no empirical basis to reconcile differences. Instead, I believe the role of the peer reviewer COULD BE two-fold: 1) to review the clinical description and ensure it is not ambiguous about the intended target of the phenotype algorithm, and 2) to verify that the phenotype algorithm measurement error estimates are appropriate measures aligned to the clinical description. Akin to peer review of a journal, the options of a peer reviewer would be to either a) accept the phenotype algorithm into a library as a valid algorithm worthy of consideration for re-use by future researchers, b) reject the phenotype algorithm because the measurement error estimates are not valid, or c) request a ‘revise-and-resubmit’ to improve the clinical description or the accompanying evidence about measurement error.

Regardless of whether a peer review takes place, it seems clear to me that the role of the researcher who is conducting a new study and needs to decide if she will re-use an existing phenotype algorithm is to: 1- determine if the clinical description of the algorithm aligns with her research question, and 2- determine if the measurement error in her data network is sufficiently small to move forward. For the researcher to accomplish #2, she either needs to 1- re-use the same data that the measurement error estimates were originally generated on, 2- produce new measurement error estimates on her own data, or 3- make a generalization assumption that the error estimates observed from the original developer are likely to be consistent with what would be expected to be observed in her new data (without empirically evaluating that assumption). Given this responsibility of the researcher, it’s unclear to me how having a peer review complete ahead of time helps, aside from potentially giving some superficial confidence in considering the phenotype algorithm in the first place.

For phenotypes that truly desire to have first occurrence of a condition within an individual’s lifetime our resources will need to enhance the ability to link records to advance the ability to computationally characterize a person’s person-time. Some individuals in a database have eight plus years of follow-up time others have the request single trip around the sun worth of data collection.

So my clinical review of a clinical description for {insert new onset chronic disease here} will likely focus on the ambiguity around new to the data or truly new to the patient based on available database follow-up time. Perhaps we should also consider crafting limitations sections to our phenotypes. Similar to my peer review of journal articles when I get to the limitations section and read “we studied 30d hospital readmission but we only know if they came back to our hospital.” We could have some authored idea around guardrails limiting future researchers over extending the phenotype capabilities. That said, I totally agree that it is the role of the future researcher to determine if the existing phenotype is fit-for-purpose for the new research question or different database.

So that said, do we need the peer review process or the pre-print model which this Phorum (it’ll catch on! we pharmacists love spelling things with Ph, it’s acidic
) so provides our community with the open source dialogue on our algorithms.

In the presence of objective criteria, the role of peer reviewer would be limited or not needed. Today we do not have objective criteria or such methods are not feasible (eg chart reviews).

The challenge we are trying to help solve is trust our phenotype (and so trust in our evidence) - what would make the work of one scientist trustable by others. In the absence of objective measures of such trust we need subjective judgments. By soliciting another peer scientist to provide their independent perspective - we hope to discover measurement errors that we would not have otherwise seen (or did not care to see).

At the minimum, a peer reviewer may look for

  1. Is the clinical idea described and unambiguous.
  2. Are the Cohort definition logic and the stated clinical ideas concordant? eg if the clinical idea is anaphylaxis but ‘insect bite anaphylaxis’ are excluded without any rationale justifying it, then it would be evidence of discordance, ie if insect bites may cause anaphylaxis why was it excluded?
  3. Have the Cohort definitions been tested on atleast one data source, and the output assessed in terms of measurement errors (sensitivity, specificity, index date misclassification).

In our field, no cohort definition is perfect - and there will always be measurement errors. If we go ahead and perform research without being aware of such errors, our work may suffer from unknown biases - ie less trust worthy.

What I think the peer scientist is doing is serving as an independent check to see if the original contributing scientist knows the (potential) measurements errors.

That is how we operationalize the phenotype. The criteria are the the means for that. The purpose of a cohort is to create a set of persons that satisfy the clinical description.

Now, what is a clinical description? It is not a textbook narrative of the ideal patient, like what we are producing a lot here. Instead, it is a biological state of health of the organism of the patient. How do we know that? We take the word of the clinical in the data, pressure test it, and recruit additional data to help with the misclassifications (see below).

Exactly right. But what we do have is having clinical knowledge to make a good guess about misclassifications and the criteria we invent to correct for that.

Which is why we need to operationalize, or engineer this process.

Step Job of the peer to pressure test
1. Define the index criterion Is the conceptset checked for sensitivity (forgot something) or specificity (too broad) errors.
2. Decide if sensitivity is a problem Are all reasons considered: (i) disease not reimbursable, (ii) disease common and not severe, (iii) disease hard to diagnose, (iv) disease often confused with differential dx, (v) codes lousy.
3. Decide if specificity is a problem Are all reasons considered: (i) disease likely to be upcoded, (ii) diagnostic workup reimbursable, (iii) disease hard to diagnose, (iii) disease often confused with differential dx, (iv) codes lousy.
4. Decide if onset timing is a problem Are all reasons considered: (i) disease builds slowly, (ii) disease sensitivity low and therefore diagnosed late, (iii) disease not reimbursable.
5. Decide if resolution timing is a problem Are all reasons considered: (i) disease can progress differently, (ii) only way to tell is if patients stop showing up.
5. Check out criteria fixing 2.-5. Are cohort comparisons evaluated: with index criterion alone vs with criteria 2.-5. added, one at a time.

We only have a chance if we apply the OHDSI principles of a systematic approach. If we do undefined “reviews” we add more pain to the current situation.

1 Like

Thanks @Christian_Reich . The steps you lay out seem reasonable as part of the phenotype development process , and directionally quite aligned with what I tried to introduce on the Phenotype Phebruary kick off call (video here). Namely that we should try to get in habit of developing our phenotypes with evaluation in mind, specifically to have each step aimed at reducing a source of error:

  1. Identify the persons who might have the disease
  • Aim: Increase sensitivity
  • Task: Create inclusive conceptsets used in cohort entry events
  1. Restrict persons who likely do not have disease
  • Aim: Increase specificity / positive predictive value
  • Task: Add inclusion criteria
  1. Determine the start and end dates for each disease episode
  • Aim: Reduce index date misspecification
  • Task: Set exit strategy, refine entry events and inclusion criteria

I would assert that we can and should build better diagnostics for each of these 3 steps: 1- diagnostics that can help us identify sensitivity errors by finding concepts we haven’t yet included which would increase our cohort size without substantially changing the composition of the characteristics of the cohort; 2- diagnostics that help us identify specificity errors by finding concepts which, if included as inclusion criteria based on a requirement of either their presence or absence, would decrease our sample size and would change the composition of the the characteristics of the cohort (under the premise that those now excluded are different people than those who remain); 3- diagnostics that help us understand the distribution of recurrence and the duration between successive entry events, so that one can determine how to differentiate follow-up care from new ‘incident’ episodes of disease; and 3- diagnostics that observe the prevalence of symptoms, diagnostic procedures, and treatments in time windows relative to the assigned index date to determine what revisions to entry events may be warranted to reduce index date misspecification.

At the end of that development and evaluation cycle, if you’ve iterated to increase sensitivity, then increase specificity, then decrease index date misspecification, on one or more databases, then you should be left with a final phenotype algorithm that warrants consideration from the community. In an preferred state, the sensitivity, specificity, positive predictive value, and index date misspecification would be explicitly quantified, but at a minimum, can be discussed in an evaluation report summarizing the algorithm.

For me, I’m just unsettled on what the incremental value of the peer review process is. If it’s simply to verify that some process has been followed, then we’re just talking about a checklist to ensure the submission has the required elements (without subjective judgement about its contents). But if we imagine peer reviewers have to execute the development/evaluation steps themselves to see if they reach the same conclusion, then I think that’s a unnecessary burden on the reviewer and quite unlikely to yield a successful replication. I think of it akin to peer review in publications: the reviewer is supposed to be assessing the scientific validity of the submission: that the methods are sufficiently described and appropriate to generate the results presented, and that interpretation of results is put in sufficient context with what is known and what can be inferred from the data. But they are generally not responsible for independently reproducing the results nor should they try to change the original aim of the study (though they can opine to the journal editors of whether they find the topic of relevance to the field). It seems to me, in our world of phenotyping, its ALWAYS appropriate to aim to develop a phenotype algorithm for a particular clinical target (so we don’t need to question intent or relevancy), so what does that leave for the peer reviewer to do?

The OHDSI Phenotype Development and Evaluation Workgroup has made progress in providing guidance on clinical description (note this is still work in progress and in DRAFT stage), however the key idea .

In Phenotype Phebruary 2022 - I introduced the idea of clinical description as the most important step, and now we assert this is a required first step.

Reason for this assertion is: if we dont spend the time to understand the clinical target being phenotyped, then in the absence of the clinical understanding of a) what it is, b) what it is not, c) how does it start, d) how does it stop, e) how is it managed – we cannot recognize measurement errors in our

  • entry event (to improve sensitivity and index date timing - by including all relevant events),
  • inclusion rules (to improve specificity - by removing ineligible events),
  • exit criteria (how long does it take to resolve).

Today there is no debate on the importance of stating upfront the target clinical ideas description, but we are still struggling with focusing the content as needed by the phenotyper. Clinical description is like a case definition but at the population level and required scientific practice.

Clinical description is important for the peer reviewer. As seen in the interactive session - Appendicitis - peer review - because the clinical description stated upfront that the clinical idea of appendicitis included the spectrum of inflammation of appendix - the peer reviewers agreed that ‘Crohns disease of appendix’ was appendicitis. Clinical description justified design choices made by the phenotyper.

It is the clinical description that articulates the target clinical idea we are trying to model with our cohort definition and justifies design choices.

Soon, the Forum software is going to give me that infamous “This subject seems to be really important to you. But consider other people also weighing in”. Will happen soon. :slight_smile:

We hear you. I don’t know either. Maybe it is a euphemism for ensuring that the process is indeed followed and all aspects are thought through? Because few people have the clinical knowledge of all conditions and procedures, AND also know how these facts end up in the data by the data capture system?

I find that example proves the opposite and is disconcerting. Because nominally that assertion is correct. But let’s say you want to study the effectiveness of one antibiotic (curative or preventative for the surgery) compared to another for some bad outcome (perforation or death) - you would never include Crohn’s, because that’s just a completely different etiology. However, if you want to study, say, the correlation between inflammatory blood markers and perforation you might include Crohn’s very well. Which means: The research question determines the clinical description, and there is no phenotype without stating the question per se.

What does that mean for our work?

If we bring this two together I think we all agree that the sequence is:

Research question → define clinical idea → phenotype development and evaluation.

concretely e.g., if this is a clinical trial, we would say the experiment is ‘acute appendicitis among adult persons without crohns disease or chronic appendicitis randomized to either antibiotic A or antibiotic B’. Then you would create patient eligibility rules for the defined clinical idea ‘acute appendicitis among adult persons without crohns disease or chronic appendicitis’ such as should be adult, should have a new diagnosis of appendicitis, should not have chronic appendicitis or crohn disease history - which would go to sites that recruit patients. We would also follow the same sequence in observational data - replace patient eligibility rules to cohort definition logic.

If the above sequence is approximately correct, then the path to phenotype development and evaluation is

’define clinical idea → phenotype development and evaluation’

i.e. there is thus no logical dependency between phenotype development and research question.

In the above example - you would specify in the clinical description to say - ‘persons with appendicitis who have co-occurring crohns disease involving appendix’ are not eligible.

This clarification in the clinical description - would justify a rule in the cohort definition logic to remove persons with crohns disease involving appendix. It would also make it clear to the peer reviewer, why a certain inclusion rule was applied (i.e. exactly 0 persons with crohns disease of appendix).

Now the evaluation is geared towards the target of ‘acute appendicitis among adult persons without crohns disease or chronic appendicitis’ and not ‘appendicitis’. i.e. finding evidence that there are persons in the data source who have ‘crohns disease of the appendix’ but those persons are not in your cohort - would not mean sensitivity errors, while seeing evidence that there are persons in the cohort with ‘crohns disease of appendix’ would indicate specificity error.

Compare to just ‘appendicitis’ - there the finding evidence that there are persons in the data source with ‘crohns disease of the appendix’ and that those persons are not in your cohort - would be a negative hit to sensitivity.

‘crohns disease of the appendix’ is a good for appendicits, but bad for ‘crohns disease of the appendix’ is guided by the clinical description i
e target clinical description which are different for ‘appendicitis’ and ‘acute appendicitis without crohns disease of appendix’

@Christian_Reich , I think I have a different take on the appendicitis example.

My perspective: there are two completely valid clinical targets that could be unambigiously described in a clinical definition: 1- acute appendicitis without prior chronic inflammation, and 2- inflammation of appendix (acute or chronic). ‘Crohn’s disease of appendix’ may be included in #2 but would likely be excluded from #1 in the clinical description. But there is no judgement about which of these clinical targets is ‘right’ or ‘wrong’, both are legitimate and can be useful for different research questions.

For each of these clinical targets (with their associated descriptions), it should be possible to develop and evaluate a phenotype algorithm, and it would seem perfectly fine to me that an OHDSI phenotype library would house both target/algorithm pairs. I don’t see any disconnect here.

It’s the responsibility of the researcher to ensure they are picking the appropriate phenotype from the library that suits their research need, but we don’t need to know that researchers question before we develop the phenotypes. In my opinion, this example seems more to point to the need to reduce ambiguity in the clinical description (and possibly ensure that our label for the phenotype is sufficiently descriptive so as to avoid confusion for future researchers).

One small addition. If we come up with thresholds for estimated sensitivity and specificity or PPV, I think that mainly addresses misspecification of the variance of the desired clinical estimate where the phenotype will be used. And I expect the thresholds need not be overly narrow (we can give some slack).

More important, and what I am beginning to work on, is differential measurement error that biases the desired clinical estimate. That requires knowing the clinical hypothesis. Right now at least, that is mostly a human review, be it phenotype developer or eventual phenotype user. I can imagine coming up with some heuristic number for estimated bias or sensitivity, but I would have a hard time really trusting it.

So we can consider it a post-phenotyping process, except presumably the better the sens. and spec. or PPV, the less opportunity for differential error.

Wow! This is a REALLY good discussion.
You all give me some good thoughts to improve/clarify the “peer review” step. As I was reading this I found myself agreeing with @Patrick_Ryan @Christian_Reich @Gowtham_Rao and @hripcsa all at the same time!!! Here are my few additional thoughts:

  • In a future state, where we have a clear, systematic, and reproducible process relying on objective measures/threshold/diagnostics - there will be NO need for a peer to review. As @Patrick_Ryan put it, the review can be a check list to ensure that the process has been followed.
  • However, until then, I see the peer review is one possible imperfect way to stress test some of the subjectivity in the process and have it questioned by an independent person.
  • Even in this imperfect way, a phenotype peer reviewer (like a publication peer reviewer) is NOT needed to redo the study/phenotype. Instead, she can use the submitted material to, 1) clarity of the clinical description and it’s consistency with the cohort definition; 2) review the interpretation made by the submitter of the data presented (cohort diagnostic, pheValuator, and maybe patients profile) and either agree or disagree on the submitter inferences on the measurement error; 3) finally, either a. accept the phenotype definition as one legitimate entry into the OHDSI phenotype library or, b. reject it based on the evidence of substantial/unacceptable measurement error or incomplete/inaccurate assessment of such errors [Here is a great example of a phenotype development and evaluation publication that concludes with a recommendation to NOT be use such phenotype definition in observational studies Phenotype Algorithms for the Identification and Characterization of Vaccine-Induced Thrombotic Thrombocytopenia in Real World Data: A Multinational Network Cohort Study - PubMed ], c. Accept with modification (this is a great way to recommend changes to the documentation to improve transparency, clarity, and reusability).
  • Now, I have served as a peer reviewer 2 times (Appendicitis this week, and Acute Kidney Injury last year). In both scenarios - I found myself redoing some of “phenotype development” steps to review the submitted phenotypes. While in theory I should not need to do that, I still did it because - like most of you - I trust the phenotype I worked on! I am hoping that by the end of this month, we get closer to harden our process (and maybe documentation) - well enough to make the peer review a simple step to verify that the measurement error is well assessed and is quantitatively small that is within acceptable range for the OHDSI Phenotype Library.

Sure. If in your clinical description you concluded the job of resolving all those potential choices, and made that clear in the text. But that means you can’t run a Phenotype Phebruary and go “who wants to do appendicitis?” There is no such a thing as “appendicitis”.

Appendicitis is one example where your phenotype requires iterative workup, with potential splitting. The notorious AMI - ischemic heart disease - unstable angina - stable angina continuum is the same. Or the ischemic - hemorrhagic - ischemic not turning hemorrhagic stroke. Or the acute pancreatitis - cholecystitis - chronic pancreatitis. Or the anaphylaxis - anaphylaxis but not from an insect bite. And so on. These are only the examples I remember from the recent past. I guess every single Phebruary cohort has to be split into several ones - depending on what you want to study.

I think I am saying exactly the same thing. Yes, you can parse all those ambiguities and add them to the clinical description. Except you would multiply each library entry and make many out of them. Maybe that’s what we should do: a hierarchical library.

The clinical description is 1000 times better than what we have now - wishy washy descriptions in the papers. But still they fall short of what is needed: The poor analyst wretch coming to the library has no idea what pitfalls appendicitis can have. And that he has to decide about the Crohn’s. And the chronic appendicitis. And maybe more.

Currently, the clinical description is a yada yada textbook summary of the disease, and how you manage it, and that the patient will be in the ER, and so on. Instead, it should say “Appendicitis, but no chronic condition” or “Acute appendicitis, including an acute eruption of a chronic condition”. Since we are doing this exercise, it would be a huge help to the community to be told what key choices they have to make.

1 Like

Differential error between what? Different phenotype versions?

Hm. How do you do that, @Azza_Shoaibi? Our opinions are only partially congruent. :slight_smile:

If for the same phenotype, the sensitivity and specificity differ by cohort (eg treatment group), then you can get biased estimates in whatever study you are doing. I guess basically, some aspect of the error would have to be a confounder, correlated with both treatment and outcome. We normally discount the possibility, but that’s what would cause the most trouble.

@hripcsa are you referring to selection bias? i.e., If T has a sensitivity of 60 and PPV of 40, and C has a sensitivity of 99 and PPV of 99


Then there are person’s in T who don’t have the disease/treatment but the cohort definition algorithm is classifying them to have it, ie they may have something else that’s we don’t want but we think they have something we want. ie measurement error.

In such scenarios, the measurement errors maybe so bad that we can’t achieve exchangility state (like randomisation) with for example our propensity score methods. In the absence of such a state, there will be residual confounding causing a lot of trouble.

Right, @Gowtham_Rao .

The question is what is the question. Atrial fibrillation as an adverse event may require a different phenotype from Atrial fibrillation requiring DOAC therapy. Thus necessitating additional clinical descriptions and additional reviews and diagnostic metrics.

We always need to start with the research question which may alter the clinical idea based on the research question and we may discover that our data is not fit-for-purpose for a given research question. If I don’t believe you are diabetic until I see a HOMA-IR or you’re not depressed until I see a HAM-D then our data will not be able to define a clinical idea needed to address a highly specific research question.

On the topic of assessing differential misclassification how in rare exposures (T) and rare outcomes do we acheive the precision necessary to conclude that 60 and 99 are different. Now these are extreme differences but 70 and 80 are different but only if you have enough data to conclude they are not.

t