Thanks @Christian_Reich . The steps you lay out seem reasonable as part of the phenotype development process , and directionally quite aligned with what I tried to introduce on the Phenotype Phebruary kick off call (video here). Namely that we should try to get in habit of developing our phenotypes with evaluation in mind, specifically to have each step aimed at reducing a source of error:
- Identify the persons who might have the disease
- Aim: Increase sensitivity
- Task: Create inclusive conceptsets used in cohort entry events
- Restrict persons who likely do not have disease
- Aim: Increase specificity / positive predictive value
- Task: Add inclusion criteria
- Determine the start and end dates for each disease episode
- Aim: Reduce index date misspecification
- Task: Set exit strategy, refine entry events and inclusion criteria
I would assert that we can and should build better diagnostics for each of these 3 steps: 1- diagnostics that can help us identify sensitivity errors by finding concepts we haven’t yet included which would increase our cohort size without substantially changing the composition of the characteristics of the cohort; 2- diagnostics that help us identify specificity errors by finding concepts which, if included as inclusion criteria based on a requirement of either their presence or absence, would decrease our sample size and would change the composition of the the characteristics of the cohort (under the premise that those now excluded are different people than those who remain); 3- diagnostics that help us understand the distribution of recurrence and the duration between successive entry events, so that one can determine how to differentiate follow-up care from new ‘incident’ episodes of disease; and 3- diagnostics that observe the prevalence of symptoms, diagnostic procedures, and treatments in time windows relative to the assigned index date to determine what revisions to entry events may be warranted to reduce index date misspecification.
At the end of that development and evaluation cycle, if you’ve iterated to increase sensitivity, then increase specificity, then decrease index date misspecification, on one or more databases, then you should be left with a final phenotype algorithm that warrants consideration from the community. In an preferred state, the sensitivity, specificity, positive predictive value, and index date misspecification would be explicitly quantified, but at a minimum, can be discussed in an evaluation report summarizing the algorithm.
For me, I’m just unsettled on what the incremental value of the peer review process is. If it’s simply to verify that some process has been followed, then we’re just talking about a checklist to ensure the submission has the required elements (without subjective judgement about its contents). But if we imagine peer reviewers have to execute the development/evaluation steps themselves to see if they reach the same conclusion, then I think that’s a unnecessary burden on the reviewer and quite unlikely to yield a successful replication. I think of it akin to peer review in publications: the reviewer is supposed to be assessing the scientific validity of the submission: that the methods are sufficiently described and appropriate to generate the results presented, and that interpretation of results is put in sufficient context with what is known and what can be inferred from the data. But they are generally not responsible for independently reproducing the results nor should they try to change the original aim of the study (though they can opine to the journal editors of whether they find the topic of relevance to the field). It seems to me, in our world of phenotyping, its ALWAYS appropriate to aim to develop a phenotype algorithm for a particular clinical target (so we don’t need to question intent or relevancy), so what does that leave for the peer reviewer to do?