Querying LAERTES and data in OHDSI reference set

rkboyce · December 19, 2015, 1:08pm

Lili,

I think you are reporting low precision (P = TP/(TP + FP) values rather than low recall (R = TP/(TP + FN)). Would you please correct that in your tables and also add a column for the actual Recall as well as the balanced F-measure (2PR/(P+R)) which nicely balances P and R?

As for the low precision you are finding. It does look concerning (after all, there are only 13K ingredients!). However, I walked through the same process for DILI Definition 1 and think I might have identified some a clue. First, are are you using counting the ‘drug_concept_id’ column or the ‘ingredient_id’ column when you count unique drugs? You should be counting the latter. Second, Erica’s query for the HOIs expands the HOI lists using ancestor/descendant relationships. But, the SNOMED concept_ids are already expanded for the OMOP definitions. So, this is probably a better approach because it is more direct and shouldn’t introduce spurious HOIs (btw, I am not seeing the same issue with modality that your are):

create the same TEMP_DRUG_HOI_EVIDENCE_W_INGREDIENTS table, and
run a simple query (you get this kind of formatting using back tics before and after the query):

SELECT COUNT(DISTINCT ingredient_id) FROM TEMP_DRUG_HOI_EVIDENCE_W_INGREDIENTS WHERE CONDITION_CONCEPT_ID IN (448372,206254,205828,449466,206944,...<ALL CONCEPT IDS FROM THE CONCEPT ID COLUMN OF THE DILI DEFINITION> ) AND MODALITY = 't'

You would have to recalculate P, R, and F but the count looks a bit better (~3K ingredients, probably still not very good though for precision). Once we are sure that we are counting the right things, we will need to trace some of the false positives specifically back how it was brought in from sources.

I expect that the main issue is still a big part of the drug class mention problem that we have already discussed. A table we can use to filter out the drug-HOI associations that occur by class mapping is at this link. The table has HOIs as OMOP concept IDs for Mesh because of the source, so will require an additional step to get to SNOMED. I can have my part time analyst work on that. It will reduce the drug-HOIs from Medline to only those with exact drug mentions (rather than class mentions mapped to individual drugs) and will likely show much better precision but lower recall.

The next version of LAERTES (set for 1/30) will try a new approach that I think will greatly improve precision without hurting recall.