Thanks @Azza_Shoaibi for kicking off this Phenotype Phebruary 2023 week 1 discussion topic. And thanks @Christian_Reich for chiming in with your perspective. I really hope we’ll hear from others on this important topic.
I really appreciate the careful thought and consideration that @Evan_Minty and @Andrea_Noel put into their peer reviews on pancreatitis and anaphylaxis, respectively, as well as the great discussion that took place on the interactive peer review of appendicitis.
That said, I’m still myself undecided about what (if any) role a peer review should play in the phenotype development and evaluation process.
My current thoughts (which I’ll try to share here provocatively in the hopes to stimulate some debates with others in the community who may have difference of opinions):
A phenotype algorithm (aka cohort definition) is a specification for how we identify a cohort, which is a set of persons who satisfy one or more inclusion criteria for a duration of time. For ease of simplicity in this discussion, I’ll focus on the task of phenotyping a disease.
For a given disease (which needs to be precisely defined in a clinical description), the task of the phenotype algorithm is simply to use the data available to identify which persons have the disease and when.
We can try to measure the extent to which a phenotype algorithm accomplishes this temporal classification task in a given database. There are three primary types of measurement error that we want to be the lookout for:
- sensitivity error - the phenotype algorithm has failed to identify persons who truly have the disease
- specificity error - the phenotype algorithm has failed to identify persons who truly don’t have the disease (incorrectly classified a person as having the disease when they do not)
- index date misspecification - the phenotype algorithm incorrectly assigns the start or end date of a person’s disease episode
Given a database estimate of sensitivity/specificity/positive predictive value and a quantification of index date misspecification (e.g. what % of persons have incorrect date?; what is the expected duration between real and assigned date?), one should be able to evaluate if the phenotype algorithm is adequate for use in that database by determining that calibration for the measurement error on the evidence statistic of interest (such as: incidence rate in a characterization study; relative risk in a population-level effect estimation analysis; AUC and calibration in a prediction study) is sufficiently small that it will not substantially impact the interpretation of the evidence. More ideally, this would be an objective diagnostic, with pre-specified decision thresholds, such that phenotype algorithm adequacy could be determined explicitly by the measurement error estimates. For example, a decision threshold could be something like: accept phenotype algorithm if ABS ( LN (positive predictive value / sensitivity) ) < 0.69; else reject phenotype algorithm. This decision threshold ensures that calibration of the incidence rate of the phenotype as an outcome in a characterization study doesn’t result in more than a 2x change in the observed incidence in either direction.
If we could agree to objective diagnostics and pre-specified decision thresholds to determine phenotype algorithm adequacy based on the measurement error estimates for a given database (and its still a big ‘if’ that requires more methodological research like what @jweave17 is doing with @Daniel_Prieto at Oxford), then what is the role of peer review?
In my opinion, the job of the peer reviewer is NOT to question the choices within the algorithm, to review the codelist, to speculate on ways the algorithm could be tweaked, or to project whether the algorithm should be used by other researchers in other databases…all of these tasks are simply pitting one person’s subjective opinions against another, with little-to-no empirical basis to reconcile differences. Instead, I believe the role of the peer reviewer COULD BE two-fold: 1) to review the clinical description and ensure it is not ambiguous about the intended target of the phenotype algorithm, and 2) to verify that the phenotype algorithm measurement error estimates are appropriate measures aligned to the clinical description. Akin to peer review of a journal, the options of a peer reviewer would be to either a) accept the phenotype algorithm into a library as a valid algorithm worthy of consideration for re-use by future researchers, b) reject the phenotype algorithm because the measurement error estimates are not valid, or c) request a ‘revise-and-resubmit’ to improve the clinical description or the accompanying evidence about measurement error.
Regardless of whether a peer review takes place, it seems clear to me that the role of the researcher who is conducting a new study and needs to decide if she will re-use an existing phenotype algorithm is to: 1- determine if the clinical description of the algorithm aligns with her research question, and 2- determine if the measurement error in her data network is sufficiently small to move forward. For the researcher to accomplish #2, she either needs to 1- re-use the same data that the measurement error estimates were originally generated on, 2- produce new measurement error estimates on her own data, or 3- make a generalization assumption that the error estimates observed from the original developer are likely to be consistent with what would be expected to be observed in her new data (without empirically evaluating that assumption). Given this responsibility of the researcher, it’s unclear to me how having a peer review complete ahead of time helps, aside from potentially giving some superficial confidence in considering the phenotype algorithm in the first place.