Building a validated OHDSI outcome library

Thank you @schuemie, this is an excellent post to get our community to
rally around formalizing our objective for cohort definition and
evaluation. I hope others can contribute here so we can develop a explicit
plan of attack to research and develop strategies that can ultimately
become a new community best practice.

Just to add a little bit to the scaffolding before jumping too far into
@nigam and @jswerdel nice candidate approaches:

In OHDSI, we have defined a cohort to be a set of persons who satisfy one
of more inclusion criteria for a duration of time.

This cohort definition has several important implications.

  1. A cohort is based on timestamped clinical observations in a database.
    These clinical observations are merely proxies for the true heath state of
    a patient at any moment in time, and some of those proxies are more or less
    reliable than others. Ex: A pharmacy dispensing record is a proxy in
    claims data for a patient’s drug exposure, but we don’t actually have true
    evidence of when or how much the patient actually consumed, or how the drug
    was metabolized in the body. It is often tempting to get lazy in
    describing a cohort by the phenomenon we seek to find vs, the proxies we
    use to infer the phenomenon (ex: ‘persons with diabetes’ vs. ‘persons with
    a health encounter where a diagnosis code of diabetes was recorded’), but
    its exactly this distinction that results in cohort misclassification
    error. There may be people who have the phenomenon of interest who don’t
    have the proxies (false negatives) and people who don’t have the phenomenon
    of interest but do have the proxies (false positives).

  2. A cohort definition is temporal in nature. So, it would be incomplete
    to say a cohort is ‘persons with diabetes’, because that description does
    not convey the notion of when the persons enter and exit the cohort. A
    more complete definition could be: ‘persons with diabetes, who enter at
    the time of their first diagnosis of the disease and exit at the end of
    their observation period’, or ‘persons who have continuous enrollment from
    Jan2017 to Dec2017, who have a diagnosis of diabetes during that time’.
    Note, just because the cohort is temporal, doesn’t mean the use of the
    cohort requires the temporality. Some expected cases: 1) often when we
    define an ‘outcome cohort’, we care about cohort entry, but are less
    concerned with cohort exit (for example, when we’re doing a
    population-level effect estimation study that examines a time-to-event
    relationship). 2) if you were doing genome association studies, we
    (currently) believe your genetics are constant and determined at birth, so
    relating a SNP to a disease has an implicit temporal relationship (e.g.
    gene preceded disease) that is not required to be explicitly modeled.
    However, when we are in a situation where the temporality information
    (either cohort entry or cohort exit) is analytically important, there is
    another potential source of error: misspecification of the cohort
    entry/exit date.

For purposes of this rant, I’m going to table the date misspecification
problem to focus on the misclassification error, but I don’t want us to
lose sight of that other source of error indefinitely:

The extent of misclassification error of any given cohort can be fully
characterized by 3 metrics: sensitivity (the proportion of true cases
correctly identified as a candidate case by a cohort definition),
specificity (the number of false cases correctly identified as not a
candidate case by a cohort definition), and positive predictive value (the
proportion of cases identified as a candidate case by a cohort definition
which are correctly true cases). With these 3 metrics in place, the entire
confusion matrix can be computed to quantify the number of ‘true
positives’, ‘false positives’, ‘false negatives’ and ‘true negatives’ in a
given population, and also to yield other associated metrics, such as:
prevalence, accuracy, and negative predictive value. Each of these three
metrics provide complementary information about the extent of
misclassification, but each on their own is insufficient. @rosa.gini had a
nice poster that showed the relationship between the various metrics at
last year’s OHDSI Symposium.

@jswerdel did a nice job of presenting on the OHDSI community call a few
weeks back about the current challenges in ‘validation’ that is seen in the
broader research community to date (
https://drive.google.com/file/d/1VuzO4nPlxHiOSBa-4LVOcdk8_CKMQlV0/view?usp=sharing).
He highlighted the work of Rubbo et al. (
https://www.ncbi.nlm.nih.gov/pubmed/25966015), who performed a systematic
review of validation efforts of the outcome of myocardial infarction;
amongst the 33 ‘validation’ studies identified, all reported positive
predictive value, but only 11 reported sensitivity and 5 reported
specificity. I think these results are generally representative of
validation work at large, where the prevailing approach to ‘validation’ for
most cohort definitions focuses on some sampling of candidate cases with
clinical adjudication through source record verification, which can provide
an estimate of positive predictive value. But positive predictive value
alone, without accompanying estimates of sensitivity and specificity, is
not directly actionable, because it doesn’t speak to how many cases are
‘missed’ (e.g. false negatives). Bush et al. (
https://www.ncbi.nlm.nih.gov/pubmed/29405474) recently showed how
alternative cohort definitions can reflect tradeoffs between PPV and
sensitivity, further underscoring the need to estimate both.

As an analogy, we see in the machine learning community, as is currently
applied as part of the best practices out of our patient-level prediction
workgroup, one commonly used measure of discriminative accuracy is Area
Under Receiver Operating Characteristic (AUROC) curve, which can be
described as the integration across the potential tradeoffs between
sensitivity and specificity at all possible threshold values. Depending on
your perspective, AUROC has the advantage or disadvantage of being agnostic
to the prevalence of the target condition. Another metric for evaluating
discrimination in prediction work is Area Under Precision-Recall Curve
(AUPRC), which can be described as the integration across the potential
tradeoffs between sensitivity (recall) and positive predictive value
(precision) at all possible threshold values. Depending on your
perspective, the AUPRC has the advantage or disadvantage of being
conditionally defined by the prevalence of the target condition. In
prediction, neither AUROC or AUPRC are ‘right’ or ‘wrong’, but rather they
provide complementary information, because together they represent all 3
necessary metrics: sensitivity, specificity, and positive predictive value.

So, the first assertion I would like to make is that we need a cohort
evaluation process that can provide estimates of sensitivity, specificity,
and positive predictive value. So, source record verification of a sample
of candidate cases would not be a sufficient process on its own, because it
only provides PPV. But it does represent one possible approach to produce
one of the estimates we need. There may be other approaches that can also
estimate positive predictive value, and it is worthwhile for us to develop
and evaluate those alternative approaches because we know source record
verification is very resource-intensive, not fully reproducible, and
increasingly problematic as the EHR data becomes the only ‘source’ to
verify against. @jon_duke and his team shared their work at last year’s
symposium in allowing electronic record verification via patient profile
annotations, and that seems like a promising direction for reducing the
resource burden of clinical adjudication, so I’m eager to see that work
completed and rolled into the OHDSI ecosystem. @jswerdel’s approach
represents an orthogonal approach to empirically estimate positive
predictive value without clinical adjudication.

But beyond building a better mouse trap for the part of the problem for
which we have an approach (PPV estimation), we need to also devote time and
attention to researching and developing methods that estimate sensitivity
and specificity. This is why I’m so excited by the work that @nigam and
@jswerdel are proposing, because it’s almost always infeasible to do source
record verification on a sufficient population sample to yield reliable
sensitivity/specificity estimates. @schuemie provides some valid concerns
about both approaches, which I share, but I think its important to frame
the context here: the first-order problem we need to solve is a shared
recognition that sensitivity/specificity/PPV need to be estimated in order
to quantify the misclassification error. The second-order problem we’re
now talking about is whether there may be systematic error in the method
that we use to estimate the sensitivity/specificity/PPV, meaning we have a
estimate but we don’t know if its biased and may need a broader perspective
to reflect our inherent uncertainty that goes beyond the sampling
variability in most current confidence interval calculations.

To that end, one reflection that may or may not be useful: we have three
different metrics that we are trying to estimate: sensitivity, specificity,
and positive predictive value. Clearly, the ‘true’ value of these metrics
are highly dependent on one another, but I’m not convinced that this means
that whatever approaches we develop to estimate each of the three metrics
has to account for this dependence. Instead, we could frame three
independent research opportunities:

  1. Develop a process to estimate the proportion of true cases that a given
    cohort definition identifies (sensitivity), and evaluate its operating
    characteristics (e.g. accuracy and precision of the sensitivity estimate)
    and efficiency (e.g. resource requirements, assumptions/constraints).

  2. Develop a process to estimate the proportion of false cases that a given
    cohort definition correctly fails to identify (specificity), and evaluate
    its operating characteristics (e.g. accuracy and precision of the
    specificity estimate) and efficiency (e.g. resource requirements,
    assumptions/constraints).

  3. Develop a process to estimate the proportion of cases identified as a
    candidate case by a cohort definition which are correctly true cases
    (positive predictive value), and evaluate its operating characteristics
    (e.g. accuracy and precision of the PPV estimate) and efficiency (e.g.
    resource requirements, assumptions/constraints).

It could very well be that our ‘best practice’ might ultimately involve one
or more different processes for each metric. I think it will be a
tremendous contribution to the field at large when we build a cohort
definition library, where each human-readable and computer-executable
definition is accompanied by the characterization statistics and cohort
evaluation metrics against the OHDSI network of databases. But I expect
it’ll prove even more valuable when we also provide the field a clear
cookbook for how to follow ‘best practice’ to define and evaluate the next
cohort that can then be added into the library. Give a man a fish and you
feed him for a day; teach a man to fish and you feed him for a lifetime.