PheValuator yet run for Cancer Phenotypes?

Daniel_Smith · October 23, 2020, 12:15pm

Hi everyone,

Per the topic. Specific call out to @rimma, @mgurley, @Christian_Reich, and @Shilpa_Ratwani (or @shilparatwani?).

We’re defining our cohort for inclusion at Emory Winship Cancer Institute from the broader data warehouse, and we have a local phenotype definition that had been used for some time. There’s also always the NCI Casefinding list that overlaps with our phenotype somewhat, but we don’t include all of the supplemental codes (https://seer.cancer.gov/tools/casefinding/).

I’d like to take what I’ve learned at OHDSI 2020 about the Phenotype Library however, and apply it locally for Winship Data by using the “Cancer” phenotype (phenotype ID 443392000). I’m currently unsure as to how the definition holds up with PheValuator. As we’re not bringing in all of our datawarehouse data in at this time, and only focusing on cancer patients, we won’t be able to run PheValuator locally, unfortunately and our data import depends upon the phenotype we end up using.

As an aside, it was wonderful to see everyone virtually at OHDSI 2020! Was a great conference and a lot was learned.

ericaVoss · October 23, 2020, 12:26pm

Also tagging @jswerdel

Christian_Reich · October 25, 2020, 5:31pm

Yeah. That’s a catch-22, then. You can only double guess your inclusion criteria using PheValuator if you give it a chance to learn how to distinguish cancer from non-cancer. If all your patients are cancer patients there is not much to do.

Why are you not bringing in the entire warehouse? The additional burden is small, but the value is much larger.

Daniel_Smith · August 25, 2021, 12:18pm

Hi @Christian_Reich,

That was a good question, that took a little while to solve. The summary is that as we are initiating this solely within Winship Cancer Institute, we focused solely on pulling in, cleaning, and having data available for our primary stakeholders: Oncologists. That said, much of what I’ve seen regarding the advantages of characterizing then evaluating phenotypes using such tools as PheValuator does indeed encourage us to ingest all of our data across our healthcare network, rather than just focus on cancer patients. Thanks as always for asking important questions that stir the search for valuable answers! Likely going to be time to switch off our filter soon, and ingest the gamut of patients.

Christian_Reich · August 26, 2021, 12:17am

Let us know if you need help. Particularly with the Oncology Extension.

Daniel_Smith · August 27, 2021, 1:20pm

Will do, and try to jump on the recent oncology WG calls, so can get updates and help there too for sure. About to start ingesting NAACCR data, but have to develop a good patient match algorithm first. Off topic of this post but @Christian_Reich , if you’ve come across good literature and options for patient match algos in your travels, I’d love to take a look.

Christian_Reich · August 27, 2021, 1:31pm

Depends what you have, and if your identifiable information is clean or dirty. And if you have a PHI protection problem. The simple matching algorithms just take what you have, match, and toss out if the result is ambiguous (e.g. two patients with the same name “Daniel Smith” but different SSN in one source, and “Daniel Smith” without SSN in the other). There are approaches who use probabilities of all kind (like how unique “Daniel” and “Smith” is). And then there are those who solve the technical problem if both sources must not see the other source or the combined asset.

Which one do you need?

Daniel_Smith · August 27, 2021, 1:42pm

clean for some sources (e.g., mrn type is provided) and dirty for many others (e.g., not sure if number is MRN of a specific type as our MRN’s differ by intake site, EMPI, or some other identifier).

no PHI protection problem across sources, for the most part, just sometimes poorly abstracted PHI.

We developed something like that awhile back with a particular score assigned based on which fields were used to create the match, with things like MRN, EMPI, last name, DOB, and SSN being weighted higher in scoring than combination first name + last name, or patient address.

This won’t apply to us in the majority of cases, and I’m not currently aware of any with our current sources to which it would apply, thankfully. I’m sure we’ll encounter this in the future, but can pass on this aspect for now.

We’re looking for the simple matchers, just to ensure that we’re not missing anything, and would likely also be intrigued by the probabilistic matchers for when the simple matching fails.