Wonderful job, and it was great practice for me going through the Cohort Diagnostics. My one question is, if this is incident diabetes, do codes for late complications of diabetes imply that this must be a prevalent case, not incident. It is incident in being recognized by that health system, but likely prevalent to the patient.
Great question @hripcsa ! It’s interesting to think about a code of ‘complication’ and the notion of incident vs. prevalent disease status. Here, we’ve applied a 365d prior observation window requirement. We could consider the question, ‘what if we required more prior observation time?’ and that may potentially clean out these prevalent cases who just hadnt had health service utilization over the original interval. So, for example, look only at incident as those cases with 730d or 1095d of prior observation, recognizing that will likely impact our sensitivity due to incomplete follow-up of our population. But even if we had complete follow-up, I still suspect we’d see many ‘incident recognized’ patients who present for the first time ever with a diabetic complication (because the diabetes was asymptomatic and undetected previously). This probably goes to the semantics of what we mean by ‘new’. ‘newly recognized’ may be a better monikor than ‘newly diseased’.
What do others think?
First - this was a phenotype masterclass. Thanks.
Second - agree that index date mis-specification is a major problem in phenotyping work, and detectable at different timescales. Here, we see it on an ‘outpatient’ time scale, it becomes even harder to handle on an inpatient time scale. When does pneumonia get diagnosed? At the timestamp of the diagnosis code, the time stamp of the chest xray, should it be at the timestamp when they hit the door for the encounter, or are any of these even resolvable in the data? It has an impact on, among other things, the predictive models we create for these outcomes. In the diabetes case, any predictive algorithm developed using the C1 definition will almost certainly converge on having high A1c and metformin use as predictive features, and probably good at predicting that a diagnosis code has been ‘left behind’ in prevalent cases, but perhaps at the cost of it doing a good job of predicting incident cases (which may be why it’s being developed in the first place).
Third - really like the structured walk through CohortDiagnostics. Population level validation is an enormous advance. In a field accustomed to chart review for validation exercises, it may benefit us to think about what a complete structured tour of CohortDiagnostics looks like. i.e. if you were to run a Delphi round using this tool for phenotype x, what are the steps? Are there different classes of x in which those steps should be different?
Thanks @Evan_Minty . You raise a good point about index date(time) misspecification for inpatient events. Very often a dataset doesn’t offer much fidelity to this; for example, administrative claims often only provide discharge diagnoses, so you may know what something happened before or during the admission, but couldn’t pinpoint exactly when. Adding date + time to the CDM was a specific ask from folks who anticipated doing lots of research at the more granular level inside of a hospital, but I havent myself seen any data partners with the timestamped data conducting analyses and examining index date misspecification at that level (though as you highlight, it is almost assuredly there).
To your point on predictive modeling, this is a topic that @jennareps and I have often discussed (including as recently earlier this week): if you run a prediction model and it gets you a really good AUC, then instead of being excited, you might ought to be worried instead, because it could mean that you’ve got some index date misspecification that’s causing your ‘predictors’ to simply be the early indicators of the outcome. Cleaning out the target cohort from all of these items is important to get an honest performance estimate of predicting future outcomes that are truly new. I think this is a critical aspect of phenotyping that is often overlooked or maybe just not thought about in the context of ‘a phenotype problem’.
I 100% agree that developing shared best practices for how to use CohortDiagnostics would be a nice effort to build out across the OHDSI community. Does anyone have any thoughts about what those best practices may be?
This is great! Such an impressive body of work from lab data cleaning to drug coding to complications.
We should talk this month about the gold standard validation efforts to really dig deep into these phenotypes. For sure significant database heterogeneity may be explained due to the collection of the data and we should keep those close to the data provenance close to our research teams. For example a single diagnosis in an EHR (looking at you CPRD) vs routinely billed during follow-up care. We will also need to be mindful that one T1DM code that was a miscode or an early rule out and it’s implications on identifying populations of interest.
Of course as gold standards go I have a probe in my pancreas monitoring beta cell function daily updating my administrative claims data provider with HOMA-IR values of insulin resistance (something I signed up for to get cash in my health savings account). So I think there will always be the patient who truly shows up with prevalent undiagnosed T2DM and on presentation has complications. Surely the T2DM is prevalent but might not even have been known to the patient let alone that health system. Think, people who show up to the ER and deliver a baby without knowing they are pregnant. Agree this nuance should be rare. Always worth thinking through the perfect (probe in pancreas) to the enemy of the good (restrictive definitions to eliminate edge cases).
This should be a fun month as we think through chronic vs acute events, chronic events that relapse and remit, etc. As a pharmacist, the only information we often have at the point of care is tell me what drugs you are on and I will tell you what’s wrong with you. So this phenotype work has far reaching implications even beyond our research needs as we work to identify populations for quality improvement initiatives and early interventions.
Thanks @Kevin_Haynes . To your point about one T1DM code possibly being a miscode and its implications on population of interest, check out this: Phenotype Phebruary Day 2 - Type 1 diabetes mellitus
Lots more fun work ahead of us!
I missed the intro to Phenotype February, so sorry if this is explained elsewhere, but is there a way to get an atlas login to the atlas-phenotype.ohdsi.org
instance so I can view the phenotype definition and concept sets?
Is this phenotype definition affected by https://github.com/OHDSI/Vocabulary-v5.0/issues/463?
Hi @Jake, this is the form @Gowtham_Rao shared to get an ATLAS login: OHDSI Atlas Phenotype Library Registration
Another diagnostic we can use for determining the performance characteristics of our algorithms is PheValuator. Here are the results I found when running PheValuator on two datasets:
From this analysis, we see that changes in the algorithms had little change in the positive predictive value (PPV) at the expense of lowering the sensitivity. The PPV for each of the algorithms was very good, at or above 85%. The low sensitivities were likely due to these incident algorithms missing a significant portion of cases. This is particularly evident in the Medicare data where most cases of diabetes are likely prevalent from the start of the time in the health plan.
This is great, thank you @jswerdel ! For those who haven’t yet played with PheValuator, here’s Joel’s initial paper on it in JBI, and here’s the codebase. He’s also presented several enhancements and benchmark studies at the last couple OHDSI Symposium, and that work is worth checking out.
It is tremendously valuable to get estimates of sensitivity, specificity and positive predictive value. for a given phenotype algorithm. Often with traditional chart review, we only get an estimate of positive predictive value, and if that’s all you’ve got, you can’t actually do any correction for measurement error.
For T2DM, it’s really interesting to see that the PPV is high (>84% for all three algorithms in both databases), but the sensitivity is more modest. This suggests that our algorithms, all of which require a diagnosis code, are missing over half the cases of T2DM. These databases don’t provide complete lab measurements, but you could imagine increasing sensitivity by creating a definition which allows persons with a diabetes drug (even without diagnosis) or persons meeting the ADA diabetes diagnostic criteria based on glucose or HbA1c values. Of course, any approach to increasing sensitivity needs to come with a proper assessment of the impact on specificity (and PPV).
@jswerdel , could you discuss briefly how you parameterized PheValuator to obtain these results?
@Patrick_Ryan this rule of requiring 365 days prior observation makes intuitive sense. From my clinical experience, adults who feel apparently healthy don’t find a reason to seek care. Type 2 Diabetes Mellitus is something that is indolent - i.e. the person feels “apparently” healthy - and probably has the disease without knowing it for a long time. But on the contrary, once a person receives the “label” of Type 2 Diabetes Mellitus - they are more likely to seek follow-up care.
This 365 days prior observation time - trys to tease out the people who are being observed in the database for the first for initial diagnostic care for Type 2 Diabetes Mellitus vs those who are receiving (follow-up management care for Type 2 Diabetes Mellitus.
But why 365 days - why not 1000days or 100 days? Is this an OHDSI best practice or is this a research opportunity?
I think it would be wonderful if Cohort Diagnostics could tell us the characteristics of the people that are in C3 but not in C2/C1. What are the attributes of people we are loosing? Are the population level characteristics of the people we are loosing similar or different from C3?
i.e. are the people we are removing less likely to represent the type of clinical profile described by the clinical description from the American Diabetic Association. If yes, thats perfect!
@Patrick_Ryan @hripcsa I think the clinical description should be clarified to make progress on this topic. One definition of incident is the first time the Doctor or the patient learnt that the person has the phenotype. Another definition is the first time the person biologically had the phenotype.
If we are refering to the first time it was learnt that the person has the phenotype: then the 365 days rule would be appropriate.
If we are refering to the first time the person biologically had the phenotype - then the clinical description could clarify that if the person also has indicators of disease chronicity such as diabetic foot, paresthesia, ulcers etc - that is not new onset diabetes. In this case, we will need to add additional inclusion rules to the cohort definition – so as exclude people with such indicator of chronic disease between day 0 and upto some future days as allowed by clinical description (e.g. if it not expected to have diabetic foot within 1 year of biological disease onset)
The clinical description is not asking for biological incidence - as written
@Gowtham_Rao , important point. I probably should have stated it more generically as:
“we add an inclusion criteria that requires some period of prior observation, with the intent to give confidence that the event is new because it hadn’t been previously observed for that prior observation duration”
365 days is purely a convenient heuristic commonly used, but it has no real empirical basis and likely is highly inappropriate in many circumstances: it is probably too short if it would be reasonable to expect a person wouldn’t go back to seek follow-up care every year because it can be managed effectively by the patient (mild arthritis and osteoporosis come to mind), and it is probably too long if it would be reasonable to expect very regular care (like end-stage renal disease, where you’d expect to see monthly dialysis). I’ll also note that the COVID pandemic totally screwed with lots of regular preventative service/well visits, so the gaps between normal care could be longer in 2021-2022, that what we may have seen previously. (Anyone in the community have a database of dental visits to plot this out? )
The decision here effectively amounts to a bias / variance tradeoff. A shorter ‘prior observation’ value will increase the chance that you are pulling in ‘prevalent’ cases into your ‘incident’ case definition, which means you’ll have greater index date misspecification. But, you’ll also have a larger sample size which will increase your statistical power for whatever question you are trying to answer (which is the argument I most often here, when people like to diddle around with this number from 365d to 180d). A longer prior observation window will increase your confidence that cases are truly incident, but then you may actually be excluding some ‘true incident’ cases simply because they don’t enough historical data.
I do think some empirical investigation could be do into looking at the impact of this design choice. Off the top of my head, I think you’d probably start by creating some evaluation set of persons who did have some extended observation period time (like, for argument sake, 10 years). Then you’d apply phenotypes with different prior observation period lengths within that subset, and you’d be able to compare the resulting patient sets. I don’t have any intuition for how big an issue this actually is, but my initial gut is that 365d could be a bit low for what we may want when services involve annual check-ups and patients may miss or delay a follow-up visit.
I agree - this is an opportunity to test the impact. I will try to visit this after phenotype phebruary thru another OHDSI Workgroup
An old Jim Lewis paper worthy of review regarding identification incident events worthy of review/replication in today’s data. The relationship between time since registration and measured incidence rates in the General Practice Research Database - PubMed
Another consideration is that the requirement of having X many days of observation before the diagnosis forces us to drop persons with shorter observation period before their diagnosis. This can be a problem in US data, where follow-up is truncated when persons change jobs or health plans. Palmsten et al showed this clearly in the context of drug safety in pregnancy using Medicaid (the observed effect is possibly more extreme than in other data sources); see figure 4 in Harnessing the Medicaid Analytic eXtract (MAX) to Evaluate Medications in Pregnancy: Design Considerations
As I was preparing to discuss the parameters for running PheValuator for this analysis, I was concerned about the low sensitivities and wondered if this could be explained. I experimented with different parameters for the evaluation cohort (to be explained below) and got higher sensitivities:
while the PPV’s remained about the same. With that in mind, let me briefly review some of the parameters (for the full explanation, please see the vignette). The process has been changed significantly since V1. In the latest version. we use visit level analyses for estimating the performance characteristics as compared to using all the data in the subject’s record in V1. Details of the xSpec and xSens cohorts are in the vignette and are a bit too lengthy to discuss here. The changes I made were in the evaluation cohort, the cohort with a large random set of subjects either with the condition of interest or without it. In the original analysis I created an evaluation cohort where a random visit in the subject’s record was selected for analysis, including for those with T2DM. However, the algorithms we wanted to test were for the earliest recorded diagnosis for T2DM. I changed the evaluation cohort to only include the earliest visit for those with T2DM. Using this approach increased the sensitivity as shown. Subjects were now being matched better on a visit by visit basis. We had observed lower sensitivities in our PheValuator Benchmark comparisons and this may help to explain that finding.
One other interesting finding, when I changed the first algorithm to not include the requirement for a 365 day lookback (the fourth algorithm in the list), I found a higher PPV compared the the original. Subjects in the prevalent algorithm, on average, may have a higher probability of being a case compared to newly diagnosed subjects. These subjects are more likely to be well into their treatment so the diagnostic predictive model used to evaluate the subjects has more evidence to support the diagnosis and estimates a higher probability of the condition.
@Gowtham_Rao Hi, sorry for joining the discussion late. Not sure whether it has already passed the registration deadline. I filled the form 2 days ago but didn’t receive any email to guide me on how to create an account to access the altas-phenotype.ohdsi.org. Please let me know if something I am missing here.
@Helen , you are definitely not too late (though I do appreciate that you’ve used a sloth as your icon for this post ). Phenotype Phebruary isn’t stopping, we’re just using it to get the conversation started. You are register for access to atlas-phenotype by filling in this form: OHDSI Atlas Phenotype Library Registration . And also, I encourage you to join the Phenotype Development/Evaluation WG, which is run from within the OHDSI MSTeams environment, and you can sign up for this and any other WG you’d like here.