Let’s say I have a rule based Phenotype algorithm to identify Lung Cancer patients. The usual example
I have a dataset of 10K patients with 9900 with positive (Lung Cancer) class and 100 being negative class (Non-lung cancer). Meaning I already know the labels.
Now I run the rule based phenotype algorithm on my dataset and it results in 9600 records. Meaning the algorithm identified 9600 out of 10K as lung cancer patients.
Now from the PheValuator documentation, I infer that it is used to assess the performance of the algorithm better by getting us other characteristics like Sensitivity, Specificity, PPV, NPV etc.
But why do I need to build a ML model to get probabilistic score of being assigned to a positive or negative class when I already know the labels (meaning whether the subject belong to positive or negative classes)?
Ex: In my 10K patients, I already know that 9900 are Lung Cancer(Positive) class and I know their subject_ids.
Similarly, for the negative class, I know the subject_ids of those 100 subjects.
Now If I compare the ALGORITHM OUTPUT (9600 subject_ids) with the original subject ids from positive and negative class, will I not get to know the characteristics (sensitivity, specificty, PPV) of the Phenotype algorithm?
- I understand the purpose of PheValuator is used to across sites? Let’s say there are 4 sites which participate in Phenotype validation study and everyone can get the characteristics of the algorithm by doing the above easy but manual approach. What’s the use of ML model here? Do we transport the model from one site to another? Doesn’t transporting the rule based phenotype definition(cohort definition json) suffice? if we would like to validate the algorithm rules across different sources. how does this work?
I am confident that am failing to see the bigger picture but what is that extra thing that PheValuator does which doesn’t strike to me due to my limited exposure?
Can help me understand this please.