This is a great thread, very happy to see this discussion taking place.
I’ll add my 2 cents for posterity sake:
How do I define ‘phenotype’? In the ‘Phenotype’/‘Cohort definition’ tutorial that I’ve offered a few times recently, I’ve used the description by @hripcsa and Dave Albers in their 2017 JAMIA paper “High-fidelity phenotyping: richness and freedom from bias”: “A phenotype is a specification of an observable, potentially changing state of an organism, as distinguished from the genotype, which is derived from an organism’s genetic makeup. The term phenotype can be applied to patient characteristics inferred from electronic health record (EHR) data. Researchers have been carrying out EHR phenotyping since the beginning of informatics, from both structured data and narrative data. The goal is to draw conclusions about a target concept based on raw EHR data, claims data, or other clinically relevant data. Phenotype algorithms – ie, algorithms that identify or characterize phenotypes – may be generated by domain exerts and knowledge engineers, including recent research in knowledge engineering or through diverse forms of machine learning…to generate novel representations of the data.”
I like this introduction for a few reasons: 1) it makes it clear that we are talking about something that’s observable in our observational data, 2) it includes the notion of time in the phenotype specification (since a state of a person can change), 3) it draws a distinction between the phenotype as the desired intent vs. the phenotype algorithm, which is the implementation of the desired intent.
In our tutorials, after I introduce the idea of ‘phenotype’ and ‘phenotype algorithms’, I introduce a new term, ‘cohort’, and here, we have a very explicit definition:
cohort = a set of persons who satisfy one or more inclusion criteria for a duration of time.
From there on in the tutorial, I try to be very precise in using this term, and reinforce this definition. We highlight how to create ‘cohort definitions’ as specifications for the criteria that persons must satisfy of time, we introduce how to design ‘cohort definitions’ using the OHDSI tool ATLAS, we demonstrate how ‘cohort definitions’ can be executed against OMOP CDM-compliant databases to identify the records which can populate the CDM’s COHORT table (which is defined by PERSON_ID, COHORT_DEFINITION_ID, COHORT_START_DATE, and COHORT_END_DATE). We highlight the consequences of subscribing to this definition of ‘cohort’: 1) one person may belong to multiple cohorts, 2) one person may belong to the same cohort at multiple different time periods, 3) one person may not belong to the same cohort multiple times during the same period of time, 4) one cohort may have zero or more members, 5) a codeset is NOT a cohort, because logic for how to use the codeset in inclusion criteria are required. And most importantly, we demonstrate how adoption of this definition of ‘cohort’ can enable the successful design and implementation of standardized analytics which rely on ‘cohorts’ as a foundational inputs to support clinical characterization, population-level effect estimation, and patient-level prediction. It’s important to note that, under this definition of ‘cohort’, a cohort can represent a disease phenotype (e.g. persons developing Type 2 diabetes), a drug exposure phenotype (e.g. persons initiating metformin for their Type 2 diabetes), a measurement phenotype (e.g. persons with hemoglobin A1c > 6.5%), or more generally, any combination of any criteria across any of the data domains that are observable (which basically means any tables in the OMOP CDM).
Now, in theory, if all data about a person was completely and accurately captured and ‘observable’, then it should be possible to take a ‘phenotype’ (the specification of the observable, changing state of the organism) and apply a ‘phenotype algorithm’ to the data to determine the spans of time for which that person satisfied the inclusion criteria to belong to the phenotype cohort. That is, with perfect data, there would be no difference between the desired intent and the materialization of the desired intent.
In practice, because data are imperfect, the ‘phenotype algorithm’ represents a proxy (whether it be a rule-based heuristic or probabilistic model) that attempts to represent the ‘phenotype’ given the available data. The phenotype cohort- the persons that satisfy the ‘phenotype algorithm’ criteria for durations of time- is an instantiation of that proxy. The differences between the true phenotype (which people actually belong to the observable health state of interest?) and the phenotype cohort (which people were identified as satisfying a set of criteria for a duration of time?) represent measurement error. There are multiple dimensions of error: 1) a person who truly belonged in the phenotype was not identified by the phenotype algorithm (false negative), 2) a person who did not belong to the phenotype was incorrectly identified by the phenotype algorithm (false positive), 3) the time at which a person entered a cohort may be misclassified (i.e. the cohort start date may not reflect the person’s true moment of entering the health state), and 4) the time at which a person exited a cohort may be misclassified (i.e. the cohort end date may not reflect the person’s true moment at which they no longer satisfied the criteria to belong to that health state).
For all of us in observational research, we need to accept that measurement error is a fact of life. All retrospective analyses of existing observational data must deal with how measurement error may influence the evidence being generated from the data. In a clinical characterization of disease natural history, measurement error may mean that prevalence or incidence of a condition is under- or over-reported. Measurement error in the cohort start date can mean misrepresentation of the time-to-event relationship between an exposure and outcome. In population-level effect estimation, misclassification in the target or comparator cohorts, or in the outcome of interest can bias our relative risks, and measurement error in baseline covariates can result in inadequate adjustment inducing confounding that can further bias the relative risk estimates. In patient-level prediction, measurement error in either the target or outcome cohorts can challenge the generalizability of the model from the proxies it was trained on to the ‘stated intent’ of the phenotypes of interest. If we had a proper understanding of the measurement error, we could incorporate it into our analyses to generate more reliable evidence that accounts for this added layer of uncertainty.
So, with all that said, I generally support the outline that @schuemie started, which others have added onto, and I like @apotvien’s framing that he has introduced in the Phenotype workgroup, so I’ll restate using the language framed above.
For each ‘phenotype’, we need to have a clear description of the stated intent (what is the observable, changing state of the organism that we are trying to represent?). I believe the stated intent likely has multiple components: 1) a clinical definition - not just a label, like a disease name (ex: ‘Type 2 diabetes’), but a complete specification of what the entity means, how it manifests and would become observable (ex. is T2DM identified because a clinician believes a person has Type 2 diabetes and records a diagnosis code in their EHR, or is it only confirmed on basis of HbA1c>6.5%?), 2) logical description - how will the clinical definition be applied to observational data? described in some human-readable form, with text and/or graphical depiction of logic, 3) intended use - how will the phenotype be applied to generate evidence? is it intended to represent an exposure of interest or an outcome or some baseline characteristic, to serve as input into a characterization, estimation or prediction study? is the phenotype intended to be applied in one specific context/database, or desired to be re-usable and transportable across a data network?
Each ‘phenotype’ could have one or more ‘phenotype algorithms’, as can be expressed as computer-executable code that. when implemented against a CDM-compliant database, instantiates a cohort representing the set of persons satisfying inclusion criteria for a duration of time. Ideally, the computer-executable code will be consistent with the human-readable logical description above.
For each observational database that a ‘phenotype algorithm’ is applied to, there is an opportunity to characterize the resulting cohort, and evaluate the performance of the ‘phenotype algorithm’ in representing the ‘phenotype’.
For cohort characterization, it would seem desirable to summarize the incidence and prevalence of the cohort within the database population, as well as detail baseline characteristics of the cohort to get some descriptive sense of the patient composition. It’d be nice to have a simple standardized analytic that would produce a ‘phenotype characterization’ that could produce the shareable set of aggregate summary statistics (no patient-level data) that could be uploaded into the phenotype library, under the phenotype entry and designated by the source data that it was contributed from.
For evaluation, it seems our real objective is to quantify the extent of measurement error. If we are talking about the misclassification of ‘false positives’ and ‘false negatives’, then the evaluation metrics that make most sense to summarize error are those directly computable from a confusion matrix: at a minimum, I would hope that we would aspire to estimate 1) sensitivity, 2) specificity, and 3) positive predictive value. Here, the trick is there is no one consensus ‘right’ way to estimate measurement error. Several approaches exist, including in this thread discussions of ‘chart adjudication’ and PheValuator. So, for our phenotype library, when a phenotype algorithm is applied to a particular data source, then we’d like to capture not just the estimates of measurement error (as represented by the operating characteristics sensitivity/specificity/positive predictive value) but also a description of the method used to estimate the measurement error. For chart adjudication, we’d like to know what sample of charts were adjudicated, which charts and how many, who were the adjudicators, what information was used for adjudication, etc. For PheValuator, we’d like to know what inputs were used in the process, including specification of the noisy positive and noisy negative labels used to train the probabilistic gold standard. The other form of measurement error is misclassification of the timing of cohort entry/exit, and similarly there, since there is no one agreed practice for evaluating this error, any estimate should be accompanied with a description of how the estimation was made.
To echo the prior sentiments from @apotvien and @SCYou, our OHDSI community ambition in building an open phenotype library should be establish and execute best practices for the design, evaluation, and dissemination of phenotypes. It is not realistic to expect that we will develop phenotype algorithms that are perfect in light of the ambiguity of medicine and the incompleteness and inaccuracies in healthcare data. But it does seem very reasonable to expect that we can apply consistent and transparent processes for phenotyping. Rather than waiting for a ‘best practice’ to be finalized and agreed to by everyone, I think we should move forward with a ‘better practice’ and see how far it takes us, recognizing that we’ll have to make some adjustments along the way. To use the analogy of a brick-and-mortar library, we are trying to construct a building that will hold books, while at the same time, trying to define what a book actually is (without having any books currently available to use as a reference). It’s hard to build the bookshelves if you don’t know how tall or wide or heavy a book can be and it’s even harder to create the Dewey decimal system card catalog without have a collection in place to organize. Our aspirational phenotype entry in our to-be-built library, one which has a complete clinical and logical description, a computable implementation that has been tested across the OHDSI network, with full characterization and comprehensive evaluation across a collection of databases, does not exist, not for one single phenotype. I think we should start drafting some books: sharing ‘phenotype’ descriptions and cohort definitions and whatever aspects of characterization and evaluation have been completed, with the explicit intention that by sharing what’s been done that we can rapidly iterate to improve these phenotypes and raise our collective confidence in their use to generate reliable evidence, but also so that we can start to build an open collection of whatever we’ve all previously developed as individuals into one shared community resource that we can all benefit from moving forward.