How to interpret the confusion matrix of an unbalanced data classification? The false positives are much more than the true positives

William · August 10, 2020, 5:27pm

Dear all,

I deal with quite some unbalanced dataset classifications using Patient-Level-Prediction in Atlas.
For example, positive vs. negative is 500 vs 210k. I move my threshold to minimize both false positive and false negative rates to 0.3 and 0.2, respectively.

The confusion matrix looks like
Predicted Positive Predicted Negative
True Positive 310 137
True Negative 50k 160k

As you can see, although the false positive/false negative rates are relatively low, we still misclassified 50k negative as positive, indicating the classifier is useless. I tried SMOTE to re-balance the data sets and also tried the cost-sensitive learning to add more penalty to false positive. They are not quite helpful.

Any insights? Or any information that can help to solve the puzzle? Does that mean PLP fails in this type of data?
Thanks!
William

William · September 23, 2020, 6:53pm

High false positive rate (i.e. low in positive predictive value ( PPV )) is always the case in all my PLPs. Not just in super unbalanced datasets.