Phevaluator - FAQ's

Akshay · February 28, 2020, 7:21am

Hello Everyone,

I have recently started to try out OHDSI tools and came across this Phevaluator R package.
Read the paper and watched few videos. great work on developing this package.

Some context here before we get onto the questions. The below is an example

My dataset has 10K records.Out of which 9900 patients have Lung Cancer. Rest 100 patients don’t have Lung Cancer…

Now I have a rule based phenotype algorithm to identify patients with Lung Cancer.

So, we implemented this algo in Atlas (created cohort definition) to identify Lung Cancer patients

This algorithm (which was implemented in Atlas) results in 9500 patients (cohort generation result)…

So like mentioned in the paper and videos, it’s easy to get or access accuracy which is (9500/10000) = 95%

But the class proportion is imbalanced in our dataset, so accuracy may not be a reliable measure to assess the performance of algorithm.

So we are interested in the wholesome performance of the phenotype algorithm, I am also interested to know the SENSITIVITY AND SPECIFICITY ETC.

As you can see that my dataset is a mix of Lung Cancer and Non-Lung Cancer and class proportion is also imbalanced… Hence the below questions

So my objective is to assess the performance of Lung Cancer algorithm and find out the characteristics like sensitivity, specificity, etc

However, I have few questions as I am just getting started and learning stuff through this forum.

XSpec cohort - 9500 patients

I understand that we create cohorts in Atlas with criteria which can help us get the positive items (For ex, if I use 10 codes for Lung cancer, then we can be sure that this person is having Lung cancer). But wouldn’t this filter out all other possible Lung cancer cases? Ex: someone might have been hospital recently diagnosed for Lung cancer and he has only 1 or 2 codes. As you can see my XSpec cohort identifies only 9500 patients as Lung cancer instead of 9900 because I chose to have only people who have only people with 10 codes of Lung cancer. In this case, aren’t we losing this person from identifying as a case? Will this drop in records impact anyway? I understand there may not be one right answer but would like to know how would you guys do this?

XSens cohort - 90 patients

Let’s say I create a cohort to “exclude Lung Cancer concepts”. Can this be a valid definition for sensitive cohort? Because this can give us a list of patients who have a high probability of not being a Lung cancer case and I get 90 patients only out of 100 (actual) under XSens cohort. Again a drop of 10 records here. Is it necessary that we try our best to get as many records as possible under each cohort definition? what is the impact of dropped records. Is there any most appropriate way to do this?

Prevalance cohort

I see in the doc that XSens can be used as Prevalence cohort. But am confused here… Ex: As I have 10K people in my population and 9900 of them have lung cancer, to know the prevalence of a disease, I need to consider the entire population (9900/10000 equals 99%). Am I right? So, trying to understand why is the default value for this field is XSens cohort? Because it may not give the prevalence of the disease. Am I right? How do I create this cohort? Should I create a cohort to identify people with Lung cancer from a population of something like XSpec + XSens? As I have both the criteria (XSpec & XSens), can this give the accurate prevalence value? Is there any most appropriate way to do this? can help me with this please

PLP model

I see that after the creation of above 3 cohorts, it is creating the diagnostic model. But in the documentation under “createPhenotypeModel”, I don’t see any settings/Parameters to tune-in for train set or test set ratio. As you can see that my data is heavily imbalanced. How do we handle such scenarios while model creation? Though I know PLP has such settings to define train and test ratios, Phevaluator doesn’t have those settings. How do we handle scenarios like this?

Evaluation cohort

I see in the doc that we get data for this cohort from XSpec but shouldn’t this be an unseen data for the model? We have already created 3 cohorts above and based on my reading of this package, I could see that it creates eval cohort from XSpec cohort.Can help me understand why do we use XSpec here? How can we find the unseen data to evaluate because we have already used our data in the creation of above 3 cohorts? I would like to understand this package better so that I can use it correctly and interpret my results accordingly

Creating PA for evaluation

I understand that this is the step where we implement our phenotype algorithm in Atlas and use it’s cohort definition id in “testPhenotypeAlgorithm” for assessment. But may I know what is “EV” under cutpoints ?

I understand we usually have 0.5 as a threshold to discriminate the classes but what does “EV” mean and how different it is from other threshold values like 0.1,0.2,0.3,0.4,0.5 0.6,0.7 etc?

Can help me with these questions please?

jswerdel · February 28, 2020, 11:48am

Hi Akshay,
PheValuator was designed to test phenotype algorithms (cohort definitions) to be used to create cohorts. For example, if you wanted to create a cohort for lung cancer in an observational data set, you might create a definition that looks for 1 lung cancer code in a subject’s record. This would generate a cohort with, perhaps, 1% of you population (assuming the prevalence of lung cancer is about 1% in your population). PheValuator would be used to calculate the performance characteristics, e.g., sensitivity, of that cohort definition (1 lung cancer code).
If you tell me the goal of your cohort definition in your population that is 99% lung cancer patients (what is the cohort (sub-population) you wish to create?), I can provide better answers to your questions.

Thanks.

Akshay · February 28, 2020, 11:59am

Hi @jswerdel,

Updated the original post with few more context. Does that help?

Akshay · March 1, 2020, 5:51am

Can help with this?

jswerdel · March 2, 2020, 12:57pm

Hi Aksay,
After thinking about it some more I don’t think you can use PheValuator for the 10K subject datasets you described, basically for the issues you mentioned (class imbalance) but also there are not enough negatives (no lung cancer) to accurately build the model. That being said, it would be possible to run PheValuator on the original data where those 10K subjects were extracted if that is available to you. In that balanced (i.e., a dataset that approximates the prevalence of lung cancer in the population) dataset you would create the xSpec and xSens cohorts and run through the 3 steps (createPhenotypeModel, createEvaluationCohort, and testPhenotypeAlgorithm) for PheValuator. The 3rd step will provide you with the performance characteristics (e.g., sensitivity, PPV) for the phenotype algorithm you are using.

Hope this helps - let me know if you have additional questions.

Akshay · March 2, 2020, 1:15pm

@jswerdel,

Thanks for the response. Unfortunately, our population/dataset that we have right now is heavily imbalanced.

So am I right to understand that Phevaluator can and only be used when the dataset is balanced?
Would you mind to help me with questions 3,4 and 5 from above?

jswerdel · March 2, 2020, 1:43pm

Hi Akshay,

it is imbalanced but the larger issue is that there are so few negatives (100) to build the model. If you were able to have, say, 10000 negatives as well as 9900 positives that could work regardless of the imbalance. I will use the dataset where there are 10000 negatives and 9900 positives to answer 3, 4, and 5 (as this is an example where PheValuator will work)
Answers to questions 3, 4, and 5
Q3 - the prevalence cohort could be the same as the xSens cohort. For example, a cohort definition where you look for >= 1 condition code for lung cancer. This should approximate the prevalence (likely about 9900/19900) and will allow PheValuator to create a model with the correct balance.

Q4 - the model is automatically created with a train set of about 75% of the data and a test set of 25% of the data. In the hypothetical 19900 subject model described, this will work fine.

Q5 - the xSpec cohort is included in the createEvaluationCohort step simply due to a limitation in the PLP package that requires some defined outcomes. In the final evaluation, the subjects in the xSpec cohort that were included are removed, keeping these outcomes “unseen”.

Akshay · March 2, 2020, 2:08pm

Hi @jswerdel

In this case, we should only be using the “XSpec” cohort for prevalance. Am I right? Because you are considering the positives (9900) from your hypothetical example. But why it is “XSens” in the doc?

If you have time, can I kindly request you to help me understand this better? What outcomes does PLP package requires for?

jswerdel · March 2, 2020, 2:28pm

xSpec would be too specific - the sensitivity of an xspec definition is usually about 10% or less - meaning that the prevalence will be greatly underestimated. While xSens will have a lower PPV (lower specificity), we have found it gives a better estimate of the prevalence, not perfect but close enough for developing the model.

If you use the applyModel function in PLP, you will get an error if there are no subjects with outcomes, that is none of the subjects in the T (treatment) population match with the population in O (outcome) population. In this case PLP will produce an error when it tries to calculate the performance (e.g., AUC).

Akshay · March 3, 2020, 3:29am

Hi @jswerdel,

Thanks for the response. May I know why do we have to exclude certain concepts?

We would have built XSpec and XSens cohort with lot of concepts, so if we have to ignore do we have to key in one by one manually?

Why do we have to exclude concepts which was used to build XSens and XSpec cohort?
If I have used more than 100 concepts (parent concepts only) for XSpec and 100 concepts (parent concepts only) for XSens, then do we have to manually key in all those 200 concepts in this field?

Can help us with this?

jswerdel · March 3, 2020, 10:57am

Hi Akshay,

The concepts used to develop the cohorts need to be excluded or else there will be a complete separation between the noisy positives and negatives. This will cause the model creation process to either fail or have extremely large coefficients for those concepts.
You would key in each of the concepts in the excludedConcepts parameter. You can copy the full list of concepts easily in ATLAS by clicking on the concept set and going to the Export tab where, toward the bottom of the screen, you will see:

where you can click on the copy bottom.

Generally though, the concepts used in the xSpec cohort should be the same as the ones in the xSens cohort so you should only have, in this case, 100 concepts to paste in.

Akshay · March 3, 2020, 11:29am

Hi @jswerdel,

I am sorry and I might be wrong too. Usually our objective is to separate both the classes? Lung cancer or not. Isn’t it? Because by noisy positives, in doc I see that these are the items that have high probability of being a positive case(lung cancer) whereas noisy negatives have high probability of not being a case (not lung cancer). So subjects under this cohort are different. May I kindly request you to help me understand what do you mean by “model creation process to fail or have extremely large coefficients for those concepts”?
Concepts used for both cohorts could be different as well. Am I right? For example, let’s say I have people Lung Cancer and Throat Cancer. When you mean they will be the same, are you saying something like below

Select all people who have atleast 1 lung cancer - Positives

Select all people who have exactly 0 lung cancer - Negatives (let’s say negatives are throat cancer).

In this case, concepts used for both positive and negative would remain same. Am I right? But does it always have to follow this convention? If written differently, wouldn’t the concepts be different?

I am trying to learn to use Phevaluator the right way.

jswerdel · March 3, 2020, 12:01pm

Hi Akshay,

The model will fail if you have perfect separation between the classes on some features. So if everyone in the positives class has a lung cancer code (because you used that code to build the xSpec cohort) and no one in the negatives class has a lung cancer code (because you used the code to build the xSens cohort and no one in the xSens cohort is allowed to be included in the negatives) you will have perfect separation causes problems with model creation. You can test this by running the process with and without the exclusion codes and look at the differences in the models (though one model may fail).
For clarity, the xSens cohort is not 0 lung cancer codes, it would be, say, one lung cancer code. The tool will exclude those subjects included in this definition. The idea is not to build the noisy negatives with the xSens cohort but rather to provide the tool with those NOT to be included in the noisy negatives.
But yes, there could be differences in the concepts used for xSpec and xSens but I haven’t found the need to do that in my research.
Attached are what I have used for my xSpec and xSens for Lung cancer:
xSpecLungCancer.docx (6.6 KB)
xSensLungCancer.docx (4.8 KB)

(Note: these should be json files but this forum does not allow files with that extension to be uploaded)

Akshay · March 6, 2020, 8:36am

Hi @jswerdel,

As this post is related to PheValuator, thought of linking it here

jswerdel · March 6, 2020, 12:47pm

How did you get your positive and negative cases? If they were from patient chart records that were verified by clinicians that would likely make this a very good dataset to test you phenotype algorithm without the use of PheValuator. PheValuator uses ML to develop a probabilistic set of cases and non-cases. This is needed because in nearly all datasets (including all the datasets we use) we are unable to do large scale chart review by clinicians to determine the “truth” of their diagnosis. We need an alternative method to determine the cases and non-cases. Once we have a probabilistic determination of the outcome for a large number of subjects, we can assess the performance of our phenotype algorithms. Your dataset, for instance, will likely produce a valid assessment of sensitivity as you have a large number of cases. My only concern with your method is the small number of non-cases. Only having 100 non-cases would make it difficult to accurately determine PPV for you algorithm.