Phenotype Phebruary 2023 - Week 3 - Debate – customized versus standardized approach for Phenotype development and evaluation

Azza_Shoaibi · February 13, 2023, 9:32pm

On week 2 of Phenotype phebruary we had a debate on Chart review gold standard validation vs innovative methods like PheValuator (debate can be found here:Phenotype Phebruary 2023 - Week 2 - Debate - Chart review gold standard validation vs innovative methods like PheValuator)

For week 3 - we are going to have a debate with two sides:

Side 1: Phenotype development should be customized to a prespecified analytical use case (research question) and for a particular data source
vs.
Side 2: Phenotype development should be done by modeling after the clinical idea irrespective of the study (analytical use case) or the data source

Why is this debate important?
Many of our collaborators have already touched on this topic in their posts during Phenotype Phebruary this year as well as last year. As we build our OHDSI phenotype library addressing this debate is essential as it has implication on if and how will we be successful. Is the phenotype library a curation of phenotype definitions that could be used in hypothetical future studies (with documentation of use case attributes) or is the library a curation of phenotype definitions with acceptable measurement error irrespective of the use case.

Patrick_Ryan · February 18, 2023, 9:18pm

C’mon people! Where’s the lively debate? This is a good question and I KNOW there are strong feelings on both sides of the fence…

Throwing out another provocative opening salvo:

Phenotype development is about identifying the set of persons who satisfy one of more criteria for a duration of time. So, for every person, we want to know their health/disease status at every moment in time. The phenotype algorithm is trying to represent that health state, and is subject to measurement errors (sensitivity - incorrectly ‘miss’ that a person has a disease; specificity - incorrectly classify a person as diseased when they are not; and index date misspecification - getting the timing of disease duration wrong). The extent of measurement error for a given phenotype algorithm is dependent on the data source - and we’ve shown that it’s often a poor assumption to generalize the error estimate from one source to another. And it can be possible that different researchers have different evidence needs and analytic user case, and their tolerance of measurement error to meet their need may vary.

All that said, we HAVE to work towards phenotype development that is focused on modeling a common and consistent clinical idea (with its fully written clinical description as a starting point). Once we have the algorithm(s) built, we can then evaluate those algorithms across the network. We need to have objective diagnostics to pre-specify the tolerance for measurement error, and use that decision threshold to select the databases for analysis for which the phenotype algorithm has empirically demonstrated to satisfy. In absence of this approach, if we don’t standardize our algorithms in a network study and instead encourage different questions and sources to all use different phenotype algorithms, we are just forcing unnecessary human-generated variance into our analysis pipeline. You will never know if the heterogeneity observed between different data sources is a result of the phenotype algorithm or the population or something else.

Now, I’ll offer a bit of a concession, albeit in a theoretical situation that I hear people in our community talk about often but have never seen happen directly myself: It’s possible that in a network of databases, that algorithm A meets the pre-defined measurement error threshold (ex. sensitivity > 50%, PPV > 50%) on Database 1, but does not meet the rule in Database 2. Meanwhile, algorithm B meets the same decision threshold on Database 2 but not Database 1. So, in this case, I’d argue there are 4 viable options: a) Use algorithm A only on Database 1, no results from Database 2; b) Use algorithm B only on Database 2, no results from Database 1; c) Use algorithm A for database 1 and algorithm B for database 2; or d) Use algorithm A and B on both Database 1 and 2. In this circumstance, I would be comfortable advocating for (d) - use both algorithms on both databases. This would be a good test for analytic robustness, and if you are properly calibrating your estimates for measurement error, then the two analyses should yield similar results. I really think (c) - use different algorithms on different databases, is quite problematic, since you’re explicitly setting up an uneven comparison. Measurement error calibration could achieve alignment of results, but it’d be very reasonable to expect a reviewer to wonder ‘what if’ you had applied a consistent definition across your data.

Kevin_Haynes · February 19, 2023, 6:37pm

I’m probably more on the side 1 of the argument but then I often live in an ideal real world where we have access to charts and a magical place where we have fully integrated EHR and claims data with extensive longitudinal follow-up. “What if” we all lived in that world! Perhaps I’m overtly concerned about the heterogeneity in our datasources, in our research questions, in our CDM(s), in our implementations. That said I think PheValuator can help rapidly and routinely quality check these algorithms across space (databases) and time (pre-post ICD9, pre-post COVID, etc.)

A few additional questions to ponder.

If my algorithm was validated in the same database I have access to with medical record review but the algorithm was not implemented in the OMOP CDM is it still valid?
What do we do when slight modifications are made to existing phenotypes? Addition of new codes? Porting an ICD-9 era validation to our ICD-10 era?

This is exactly why I call my insurance company and health system daily to reverse hypochondriac report the absence of a list of diseases daily. They’ve since stopped taking my calls.

For many conditions this is where the art of medicine meets the science of medicine. “The major objectives of the history are to differentiate Disease A from Disease B, Disease C, or Disease D. All are frequently confused with each other.” Reads many a history and physical section of our available clinical descriptions. If the medical community has trouble parsing disease surly our data will be challenged as we have seen time and again in our Phenotype deliberations.

The extent of the measurement error may also be dependent on a specific period of time in a database. So it may be poor to generalize the error estimate from one source to the same source over time.

Can we port validated algorithms in the same database over time?

Although, I do support the development of side 2 for rapid implementation. Each time before I go to the pharmacy I say “Alexa, what is the risk of Disease X among new users of Drug Z with Indication G…send output to my phone?” She pulls a phenotype library record for Disease X and Indication G, runs a comparator selection on Drug Z, selects negative controls, and runs the analysis outputting the meta-analysis to my phone. Of course my pharmacist has more than enough time to discuss this output with me at pick up. My pharmacist loves me no really she does.

hripcsa · February 20, 2023, 3:27pm

@Patrick_Ryan , I agree that I would opt for option (d). What I have seen sometimes is that one phenotype is generated for a study. Based on trying it at different sites, you develop code that accommodates particular sites without affecting the others. So it appears to be one phenotype, but there is really something in there equivalent to a case statement selecting what part of the query runs (it can be done less transparently by for example asking what code or data type is available). Option (d) is more transparent.

atifadam · February 20, 2023, 6:17pm

Jumping on the “provocative” debate stream. (still new to OHDSI forum’s debating guidelines!)

I would like to see if we can start considering the heterogeneity in database types as a factor in selecting side 1 vs. side 2 and even option a-d?

EXAMPLE: Should we really be comparing (hypothetical) Database 1 that is out-patient EMR based data source versus Database 2 that is an employer-based claims data source in the same network study?

A patient’s journey or clinical pathway/story can have inherently different “narration” based on the database origination principle.

@Patrick_Ryan
While I agree that,

“extent of measurement error for a given phenotype algorithm is dependent on the data source”

it might also be intriguing if a
,

“common and consistent clinical idea”

would benefit if we expanded our Phenotype approach to have

EXPOSURE: Fully written clinical description as a starting point
ENDPOINT: Objectives and outcomes needed from the clinical description
PREFERRED: Data source type needed to build this description
LIMITATION: Alternate data source types that can be used and what would be generalized gaps

At least from my perspective, this paints an OHDSI phenotype as a collaborative implementation strategy on using real-world data (knowledge) into scalable and transparent real-world evidence (action).

agolozar · February 21, 2023, 5:05am

Assuming we all lived in @Kevin_Haynes’s magical world, and we had everything recorded in the data from the very beginning of life till death, I say, scenario 2 would be the way to go and we can develop reusable phenotypes only through modeling a clinical idea. We might not actually need any phenotyping if we ever lived in that world. The truth is in the CONDITION_OCCURENCE table. We can just use that. Until we get there, I am with @Kevin_Haynes on the side 1 of the argument.

The journey from a clinical idea to a fully validated and reusable phenotype is not straight forward. It requires a lot of decisions, each with its own consequence. Actually, reminds me of the black mirror movie. Each decision can put you on a different path and different permutations of the decisions made along the road leads to different versions of a phenotype. I believe that the development and evaluation of reusable phenotypes should be informed by a pre-defined set of inputs coming from the clinical idea and the condition specific information (onset and duration of the condition, severity, etc.), the use case/flavour, the context of use and the type of data that will be used for the development, evaluation, and further use of the phenotype. Otherwise, we risk developing rather complicated algorithms with limited use in network studies.

I would like to walk you through my train of thoughts when it comes to developing a phenotype to show you where I am coming from:

Where do I start?

I know the clinical idea and I have also learned from the phenotype development tutorials that it all begins with 1) identifying the lego bricks (conceptsets) and 2) a building instruction (logics). So how do I identify the legos and come up with the building instruction? I use a cheat sheet, a list of questions that help put me find my way. This list is not exhaustive and definitely does not capture everything but it is good enough to put me on the right track.

1. Lego selection:

What conceptsets to create? for example, is a disease conceptset enough? Or do I need to create conceptsets for symptoms, diagnostic procedures, treatment, differential diagnosis, visit context or predisposing conceptsets,? And of course, the desire to include to any of these means a series of other follow up questions to further clarify the path. What symptoms? What treatments? What procedures? and …
What concepts to include in the conceptset? for example, should I use history of disease, or should I avoid it? How about the complication of the disease? How high should I go up the hierarchy?

2. Building instruction/logic?

What is the index event? Disease or symptoms of the disease or the treatments or the diagnostic procedure?
What inclusion or exclusion criteria? for example, should I include or exclude differential diagnosis? Do I need a second occurrence of the index event during the follow up? Within what timeframe from the first occurrence do I expect to see the second one? How about a procedure work-up prior to condition? What is the minimum acceptable length of diagnostic workup before diagnosis and re-diagnosis?
What is the exit date? for example, what is an optimal time relative to index that a person is considered with the disease?

All of a sudden, I am faced with so many different lego options and many more building instructions. All of these choices should be made for a reason: what am I trying to optimize for? Am I aiming for a sensitive definition or a specific one. Do I care about index date misspecification and to what extent can I live with the potential misspecification? Or do I need very clear cohort exit date? To come up with the most appropriate combination, I need to know this optimization strategy. The clinical idea alone does not give me enough information. For example, without knowing that the phenotype should be used for identifying patients with incident breast cancer in a set of cancer registries, I would not know if I should include concepts such as history of breast cancer or create a conceptset for diagnostic procedures for the disease and use that as an inclusion criterion, or if two diagnoses codes of the disease should be considered? This information is coming from the use case, the context of use and the type of data being used for the development, evaluation, and use of the phenotypes. Without that, I am left with so many possibilities and little idea when I am done and what is good enough. This impacts the efficiency and transparency of the process and further reusability of the phenotype I am creating. Likewise, as a future user of specific phenotypes, I will be faced with a huge degree of uncertainty when it comes to selecting the right phenotypes and assessing its utility in correctly identifying the clinical idea in my data source(s).

That being said, I agree with @Patrick_Ryan that option (c), using different algorithms on different databases, to answer a question in different databases is problematic. I think we can develop phenotypes using a systematic and structured approach with clear description on what they are trying to captures based on all these input. At the time of the study, researchers can use this information to assess if the phenotype can be used in their databases and choose the appropriate databases to be included in the study. This way, the reason for exclusion of certain databases can be reported clearly and any tweaks to the existing phenotype can be identified easily.

Gowtham_Rao · February 23, 2023, 12:00pm

Well - what is ‘phenotyping’? I think about it as a ‘model’ of something of real. Here real is the period of time a persons truly experience a condition or treatment.

What is a model?

A model is an informative representation of an object, person or system.

A conceptual model is a theoretical representation of a system. It consists of concepts used to help people know, understand, or simulate a subject the model represents.

Mathematical models are used in the natural sciences (such as physics, biology, earth science, chemistry) and engineering disciplines (such as computer science, electrical engineering),

We can think about the ‘clinical description’ as the Conceptual model describing the target. It is for investigators/people to agree on the target.
We can think of the ‘Cohort Definition’ as the mathematical model - this is for the computer and contains a set of specified parameters (e.g. rule based or probabilistic based).

If we agree on the above framing, then I think we can extend this discussion as follows:

We construct mathematical models: Mathematical models maybe constructed to maximize a certain (1 or more) operating characteristic. This ‘operating characteristic’ has been called ‘flavors’ by others in the community (@Christian_Reich @Daniel_Prieto ).

Model construction is inevitable: In our field, the parameters for maximizing these operating characteristics are person level events represented in vocabulary concepts (time stamped person level observations). These are imprecise and incomplete (as @Kevin_Haynes argues), and thus as @agolozar said we cannot use just the ‘condition_occurrence’ table verbatim. This fact will persist as a long as we are performing observational data based secondary research.

Model construction has humans in the loop: This is a source of investigator bias. Humans are reductionists and follow the principles of parsimony. We select parameters based on intuition, experience, or expert opinion, or convenience. Good science requires reduction of such bias.

Models should be evaluated:

A crucial part of the modeling process is the evaluation of whether or not a given mathematical model describes a system accurately.

Operating characteristics: In our field these are sensitivity, specificity, positive predictive value, negative predictive, and the new OHDSI one Index-date-misspecification. These operating characteristics are attributes of cohort definition * data source, i.e. the same cohort definition may have different operating characteristic when applied to different data source.
Scope of the model and fit to a data source: A cohort definition may have been built and tested on an electronic health record data source that has lab results. This cohort definition has a narrower scope as it requires lab results. Lab results are not present in billing data sources.
Peer review: agreement/disagreement among peers may help improve model evaluation.

Gowtham_Rao · February 23, 2023, 12:09pm

That being said - where do I stand in this debate? I believe the sequence is

Research question → define clinical idea → phenotype development and evaluation.

But that does not mean I am on side 1 Because, if the above sequence is approximately correct, then the path to phenotype development and evaluation is

’define clinical idea → phenotype development and evaluation’

i.e. it may be considered a sequence of linear steps.

Side 1 is arguing for ’define clinical idea → phenotype development and evaluation’
Side 2 is arguing for Research question → define clinical idea

But my position is - both sides need define clinical idea. This is the conceptual model - a theoretical representation of a system. Phenotype Development and Evaluation is the process of mathematically modeling and evaluating this clinical conceptual model . If the library is able to clearly say, that we have 10 cohort definitions (mathematical models) for the same clinical description (conceptual model), and a research needs to study a conceptual model (i.e. they accept the clinical description), then the role of the researcher is now reduced to - given 10 cohort definitions with various operating characteristics reported on a set of data sources - to pick and choose the operating characteristic that is best suited for their study.

e.g. if you are doing a disease surveillance and would like an operating characteristic that is sensitive (then sort by sensitivity). If you are doing a comparative study and want the phenotype to be very specific then sort by specificity. If you are doing time to event analysis, and want precise index date - sort by index date misclassification.

Evan_Minty · February 27, 2023, 6:06am

Thanks @Gowtham_Rao – I think this conceptualization really advances the discussion. And thanks to all the others on here for a great thread.

A further consideration is whether we presume all research needs are best served by federated network research (and a preference for a generalizable cohort definition). Many organizational use cases may instead be driven by a requirement to maximize performance in their data and their data alone. (e.g. in patient level prediction, proving out generalizability is great, and important, but many models may be driven by local needs with performance proven out in local validation cycles). That notwithstanding, OMOP and OHDSI tools may be a valued part of that engine, and they may drive value from a phenotype library that gets their process started with a phenotype definition best suited to their use case.
I think both interests are served by getting a better sense of how the different components (@agolozar 's Lego’s) and combinatorics of a definition (i.e. variations-of-interest in concept sets, and combinations-of-interest between them) perform in a network of databases.

(2) is not straightforward to do. The combinatorics quickly explode, and even with simplifying assumptions, it’s not reasonably tractable (or configurable) without a computational approach.

The tools to do so are likely there in CirceR, Phea, and others, although we’d need a think more about a framework, and a lot of validation related compute…

But what could get interesting there, is then, could a pre-specified error tolerance could inform not a ‘go / no-go’ on the cohort definition / database, but possibly on the combination of components helps configure a definition in a database and permits its entry in a study?

I’d argue this increases our N, our geographic diversity in the network study, while minimizing our measurement error… wouldn’t that lead to more reliable evidence?

A core question becomes,

What if it’s machine-generated variance instead?

(I’m sure week 4 still has some interesting discussion to come, on that theme )