I will chime in on this, as I’ve been a big fan of the complimentary nature of
- Profile review (thorugh CohortExplorer, or ATLAS - our equivalents to chart review, pending KEEPER, although narrative chart review can also have a role)
- Population level diagnostics (i.e. cohort diagnostics)
- Phevaluator / statistically learned models.
Event sequences tell the patient story.
Currently, it’s difficult to infer sequences from our population level results in cohort diagnostics (beyond typical …[-30,-1], index, [1-30]… windows), and it can be difficult to reconstruct population level strata ‘stories’ from feature extractor output in Cohort Diagnostics. Neither of these diminish the big advance ‘population level validation’ represents for our field!
We do have tools / adjuncts that could conceivably do this. Time windows can be adjusted (although things get messy at a population level, if you try to get too granular), cohorts-as-features in cohort diagnostics can sum across concepts, and treatment pathways can lay out population level sequences. Configuring them upfront however, would represent a large front loading of effort in phenotype development.
Contra @Patrick_Ryan 's provocation around the value of timeline review in ascertaining PPV. In reviewing timelines / cases, the sequences of most interest are the misses, not the successes or their frequency.
Discerning the ‘failure modes’ of the misses, often benefits from the information that we are otherwise losing regarding event sequences, or that we would have otherwise struggled to anticipate a priori. Once identified however, questions around the prevalence of possible ‘failure modes’, can often be answered in cohortDiagnostics output.
Synergy, Kumbaya, and all that!
But I’ll also be provocative.
Is it possible that our current processes and tools have been conditioned by the fact that we don’t routinely adjudicate individual profiles as part of our process? In adjudicating profiles, for example, one thing I’ve found is that often, the sum of features relating to a competing diagnosis can be the ‘feature’ that outweighs the diagnosis of interest in ultimately determining a ‘case’. We don’t have tools to evaluate that easily (hat tip, Phea, & coming: KEEPER). But is that because we haven’t been exposed to those failure modes enough?
Given the interest in ‘failure modes’, I’d agree ‘in-sample’ validation is in some ways the least interesting. I’ve still found it useful. It has revealed issues that I don’t think easily come to the foreground otherwise. I’ve made mistakes in cohort design logic that are immediately obvious in looking at <5 trajectories.
In sum - I think Profile validation has a role in a few areas that tie together our tools.
- In sample validation. 15-20 cases. Main interest is whether there’s a pattern when it fails; the prevalence of those failures can often be assessed with cohort diagnostics output.
- Xspec validation. 15 cases. Our prior for misses is very low, but worth examining simply given how much of our subsequent decision making is conditioned on xSpec output. Almost all cases, so should be fast! Might save you angst down the road! To @katy-sadowski’s point, I’d agree (having been involved in Aphrodite generation of the Charlson co-morbidities) that some things are simply hard to phenotype, statistical model or not, and some features aren’t what we wish they were.
- (Proposed, as I have limited experience in this set): Areas of disagreement between pheValuator, and the rule based cohort definition (false positives, false negatives). These can be selected from samples that are close to the decision boundaries. In the information theory sense, this is where we can probably learn the most. Maybe 15 of each. These may well inform future iterations of the cohort definition. (I also really like @Christophe_Lambert 's approach to calibration – look forward to reading it when it’s out in print).
- There are situations where we reach a boundary of what we can determine from the observational record. A good example is trying to determine differences between single encounters for a mild case of X, and an encounter for a ‘rule – out’ X . Those may even be reasonably prevalent . In sites amenable to it, a targeted narrative chart review (or event result report review) might yield some insights, to the extent that they relate to underlying biology, they may even generalize, more or less. I’d be wholly excited by any forthcoming finding that showed that such insights aren’t important enough to change an effect estimate (@jweave17 ) How excited? One side of my chest is not yet spoken for!