OHDSI Home | Forums | Wiki | Github

Querying LAERTES and data in OHDSI reference set

The LU_REF_SET_HOI_DEF is to define the OMOP and EUADR reference sets (this was used for research purposes with LAERTES). I’m guessing you aren’t specifically trying to use that.

The reason you only get HOI_CONCEPT_IDs back with '01 is because of your second filter is restricting yourself to that.
and hoi_concept_id = 500000801

Could you take a step back and explain the question you are trying to answer? I may have missed this from the original thread.

Hi Erica,

Apologies for the delay in response and if my question wasn’t clear. Allow
me to clarify.

I’m trying to validate results in the drug-HOI relationships from “OMOP
2011-2012 Experimental Methods results”. This reference set includes
drug-HOI relationships on the aforementioned list of cohort-based HOI IDs.

http://omop.org/researcharchive
http://75.101.131.161/download/loadfile.php?docname=OMOP%202011-2012%20Experiment%20Method%20Results

After perusing the LAERTES database, I noticed that the numbers (i…e
500000801, 500000802, etc.) do not correspond to conditions, but cohorts.
therefore, I can’t search the LAERTES database for drugs associated with
condition (which has corresponding condition_concept_id) if these are
cohort-based HOI IDs. In effort to find the condition concept ID
corresponding to each cohort HOI ID, I had found the LU_REF_SET_HOI_DEF,
and noticed that it contains information on both the hoi_concept_id and
condition_concept_id.

Accordingly, I had executed the query for each of the HOI IDs, by replacing
the number as follows, for example for 500000802:

select
rxnorm_drug, drug, hoi_concept_id, hoi_concept_name, hoi, snomed_hoi,
condition_concept_name, evidence_type, modality
from public.drug_hoi_relationship relationship
join public.lu_ref_set_hoi_def ref_set on ref_set.condition_concept_id =
relationship.hoi
join public.drug_hoi_evidence evidence on evidence.drug_hoi_relationship =
relationship.id
where hoi in
(select condition_concept_id from public.lu_ref_set_hoi_def
where hoi_concept_id in (500000802))
and hoi_concept_id = 500000802
and modality = ‘t’
order by drug;

However, as I’d said in my earlier thread, I could only find information on
cohort HOI ID ending in *01.

I would appreciate if you could assist me in solving this problem.

Thank you,
Lili

@lilipeng,

Thank you for the additional information.

The CONDITION_CONCEPT_ID in this document represent a suite of CONCEPT_IDs found in the OMOP Vocabulary (which based on your comment above I think you figured out). For example 500000301-“OMOP Acute Liver Failure 1” is an OMOP generated concept with 902 SNOMED concepts beneath it (not including itself). You can find these concepts by asking the Vocabulary:

CONCEPTS RELATED TO HOI 50000030

SELECT DISTINCT c1.CONCEPT_ID, c1.CONCEPT_NAME
FROM CONCEPT c
	JOIN CONCEPT_ANCESTOR ca
		ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
	JOIN CONCEPT c1
		ON c1.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID
		AND c1.INVALID_REASON IS NULL
WHERE c.CONCEPT_ID = 500000301 /*OMOP Acute Liver Failure 1*/

I have to apologize, the table you were using LU_REF_SET_HOI_DEF is a version of the OMOP Reference Set that I created more based on the original OMOP experiment. It may be more appropriate to use the OMOP Vocab version I shared above.

I think the main issue here is you are using my lookup and it probably won’t suit your needs here.

Try take a bit different of an approach:

DIFFERENT TAKE ON LILI’S QUERY

WITH CTE_HOI AS (
	SELECT DISTINCT c.CONCEPT_ID AS HOI_CONCEPT_ID, c.CONCEPT_NAME AS HOI_CONCEPT_NAME, 
		c1.CONCEPT_ID AS CONDITION_CONCEPT_ID, c1.CONCEPT_NAME AS CONDITION_CONCEPT_NAME
	FROM CONCEPT c
		JOIN CONCEPT_ANCESTOR ca
			ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
		JOIN CONCEPT c1
			ON c1.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID
			AND c1.INVALID_REASON IS NULL
	WHERE c.CONCEPT_ID IN (
		/*YOU NEED TO DOUBLE CHECK THAT THESE ARE ALL THE OMOP HOI CONCEPT IDS*/
		500000402,500000404,500000801,500001003,500000308,500000303,500000802,500000805,500000301,500000403,500000405,500000401,500000302,500000305,500001001,500000307,500000803,500000306,500000304
	)
)
SELECT z.DRUG_HOI_RELATIONSHIP, h.HOI_CONCEPT_ID, h.HOI_CONCEPT_NAME, h.CONDITION_CONCEPT_ID, h.CONDITION_CONCEPT_NAME, z.DRUG AS DRUG_CONCEPT_ID, z.RXNORM_DRUG AS DRUG_CONCEPT_NAME, 
		z.EVIDENCE_TYPE, z.MODALITY, z.STATISTIC_TYPE, z.STATISTIC_VALUE, z.EVIDENCE_LINKOUT 
FROM CTE_HOI h
	JOIN (
		SELECT *
		FROM DRUG_HOI_EVIDENCE e
			JOIN DRUG_HOI_RELATIONSHIP r
				ON r.ID = e.DRUG_HOI_RELATIONSHIP
		WHERE MODALITY = 't'
	) z
		ON z.HOI = h.CONDITION_CONCEPT_ID
ORDER BY h.HOI_CONCEPT_NAME, h.CONDITION_CONCEPT_NAME, z.RXNORM_DRUG, z.EVIDENCE_TYPE

There are 19 HOIs that I could see in your file you sent me, but the file truncated in Excel so I could be missing others, please double check this. If I run this query, only 10 HOIs show up - but you’ll have to investigate why. It might be how they are defined in the Vocabulary. Or the concepts used in the definition don’t directly align with what is in LAERTES - this could be the LAERTES evidence was coming in at different levels of SNOMED and it doesn’t directly align with the HOI maps.

Working with conditions can be tricky business. :slight_smile:

Hi Erica,

I finally had a chance to go over your query. First, I would like to thank you for the thorough, informative response, and that you had written the query specifically catered to my question.

I’m somewhat new to querying the database using the OMOP Vocabulary and it didn’t naturally occur to me to have to use the CONCEPT_ANCESTOR table to fetch the condition_concept_IDs that are descendants of the HOI concept IDs (ancestors).

Can you explain why you set the following statement:
AND c1.INVALID_REASON IS NULL
Why are there columns in which the invalid_reason is populated to not null?

Thanks again!

Lili

@lilipeng,

The OMOP Vocabulary is made up of relationships however just using the CONCEPT_RELATIONSHIP table can be tricky, you don’t always know how many hops you’d have to make to get all children or all parents. OMOP created the CONCEPT_ANCESTOR table to solve that.

This presentation is a bit dated, but it may be helpful understanding the structure of the OMOP Vocab:
http://omop.org/sites/default/files/OMOP%20Symposium%20Day%201%20Workshop%205Nov2013%20Reich%20vocabulary.pdf

For why I used INVALID_REASON IS NULL, OMOP CONCEPT_IDs can become invalid over time and when they do the INVALID_REASON is populated with a D (deprecated) or U (I think stands for updated). To keep from ever picking up an invalid CONCEPT_ID I used that filter. There may be a situation where you need an INVALID concept, but most likely not for what you are doing here.

No worries and hope this is helpful!

Hi Erica,

"For why I used INVALID_REASON IS NULL, OMOP CONCEPT_IDs can become invalid
over time and when they do the INVALID_REASON is populated with a D
(deprecated) or U (I think stands for updated). ]

What are reasons a concept_ID can become deprecated over time?

To keep from ever picking up an invalid CONCEPT_ID I used that filter.
There may be a situation where you need an INVALID concept, but most likely
not for what you are doing here."’

Out of curiosity, what are such situations? A possible scenario:
looking up data on a condition-based cohort in which the corresponding
condition_concept_ID has been updated (U) from a deprecated (D) one?

I’d executed the query that you’d written for me, but I was only able to
find condition_concept_IDs for 10 out of 18 HOI IDs, still (see below). I
wonder if this is simply characteristic of the information (or lack
thereof) in the OMOP vocabulary. Because when execute the following query
for 500000301:

SELECT DISTINCT c.CONCEPT_ID AS HOI_CONCEPT_ID, c.CONCEPT_NAME AS

HOI_CONCEPT_NAME,
c1.CONCEPT_ID AS CONDITION_CONCEPT_ID, c1.CONCEPT_NAME AS
CONDITION_CONCEPT_NAME
FROM CONCEPT c
JOIN CONCEPT_ANCESTOR ca
ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
JOIN CONCEPT c1
ON c1.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID
AND c1.INVALID_REASON IS NULL
WHERE c.CONCEPT_ID IN (500000801)

I get following results (abbreviated):

hoi_concept_id hoi_concept_name condition_concept_id condition_concept_name
500000801 OMOP Acute Myocardial Infarction 1 44782769 Acute myocardial
infarction due to left coronary artery occlusion 500000801 OMOP Acute
Myocardial Infarction 1 4201629 Refractory angina 500000801 OMOP Acute
Myocardial Infarction 1 4119455 New onset angina 500000801 OMOP Acute
Myocardial Infarction 1 45766151 Acute ST segment elevation myocardial
infarction of inferior wall involving right ventricle 500000801 OMOP Acute
Myocardial Infarction 1 4161456 Recurrent angina after coronary stent
placement 500000801 OMOP Acute Myocardial Infarction 1 4051874 Acute
myocardial infarction of posterolateral wall 500000801 OMOP Acute
Myocardial Infarction 1 4145721 Acute non-Q wave infarction 500000801 OMOP
Acute Myocardial Infarction 1 45766075 Acute anterior ST segment elevation
myocardial infarction

However, when I execute the query for the 8 following HOI IDs:

SELECT DISTINCT c.CONCEPT_ID AS HOI_CONCEPT_ID, c.CONCEPT_NAME AS

HOI_CONCEPT_NAME,
c1.CONCEPT_ID AS CONDITION_CONCEPT_ID, c1.CONCEPT_NAME AS
CONDITION_CONCEPT_NAME
FROM CONCEPT c
JOIN CONCEPT_ANCESTOR ca
ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
JOIN CONCEPT c1
ON c1.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID
AND c1.INVALID_REASON IS NULL
WHERE c.CONCEPT_ID IN (500000802, 500000803, 500000805, 500000303,
500000305, 500000306, 500000403, 500000803, 500001003);

I get the the following results:

hoi_concept_id hoi_concept_name condition_concept_id condition_concept_name
500000803 OMOP Acute Myocardial Infarction 3 500000803 OMOP Acute
Myocardial Infarction 3 500000403 OMOP Acute Renal Failure 3 500000403 OMOP
Acute Renal Failure 3 500000305 OMOP Acute Liver Failure 5 500000305 OMOP
Acute Liver Failure 5 500000303 OMOP Acute Liver Failure 3 500000303 OMOP
Acute Liver Failure 3 500000306 OMOP Acute Liver Failure 6 500000306 OMOP
Acute Liver Failure 6 500001003 OMOP Upper GastrointestinaI Ulcer 3
500001003 OMOP Upper GastrointestinaI Ulcer 3 500000802 OMOP Acute
Myocardial Infarction 2 500000802 OMOP Acute Myocardial Infarction 2
500000805 OMOP Acute Myocardial Infarction 5 500000805 OMOP Acute
Myocardial Infarction 5

I find it odd that the hoi_concept_ID and condition_concept_ID values are
identical, just as their respective hoi_concept_name and
condition_concept_name entries are the same. Should I be alarmed about
these results in the OMOP Vocabulary?

Thanks again,
Lili

The unique identifier CONCEPT_ID from OMOP will always exist. However sometimes the vocabularies are no longer supported or a code actually becomes deprecated. For example DRG ‘001’ was ‘Craniotomy Age >17 with Complications, Comorbidities’ untill 2007 and now it is ‘Heart transplant or implant of heart assist system w MCC’. The older version is deprecated but we still need it around because it can be a valid concept. With that same code, ‘001’ used to be ‘Craniotomy Age >17 Except For Trauma’ until 2002 and was updated ‘U’ to ‘Craniotomy Age >17 with Complications, Comorbidities’.

Run this on the OMOP Vocab to see.

SELECT *
FROM CONCEPT
WHERE  CONCEPT_CODE = '001'
AND VOCABULARY_ID = 'DRG'

I wasn’t able to get through the rest of your comment before I had to head out of the office. I’ll be out of office next week for the holiday but I’ll try to pick up when I get back.

Hi Erica, Richard, Sara, and others:

Yesterday I had shown the following results on the amount of data in
LAERTES, for 10 HOI concepts, that could be validated from the OMOP
2011-2012 reference set (aforementioned above). Based on these data,
approximately 2-4% of the data for each HOI could be verified.

HOI concept ID hoi_concept_name LAERTES: Total # unique drug concept IDs
from test set (Evidence of association) OMOP reference: Total # unique drug
concept IDs from reference set # Matches Recall 500000301 OMOP Acute Liver
Failure 1 4843 117 117 0.024158579 500000302 OMOP Acute Liver Failure 2
4228 2 2 0.000473037 500000307 OMOP Acute Liver Failure 7 984 117 82
0.083333333 500000308 OMOP Acute Liver Failure 8 4228 117 117 0.027672658
500000401 OMOP Acute Renal Failure 1 2897 88 84 0.028995513 500000402 OMOP
Acute Renal Failure 2 2897 88 84 0.028995513 500000404 OMOP Acute Renal
Failure 4 2897 88 84 0.028995513 500000405 OMOP Acute Renal Failure 5 2897
88 84 0.028995513 500000801 OMOP Acute Myocardial Infarction 1 2732 102 99
0.036237189 500001001 OMOP Upper Gastrointestinal Ulcer 1 1863 91 81
0.043478261

However, we had discussed whether the # unique drug concept IDs from the
LAERTES included ones not limited to the ingredient level. Accordingly,
I’ve edited the query 'AND c1.CONCEPT_NAME=‘Ingredient’.

WITH CTE_HOI AS (
SELECT DISTINCT c.CONCEPT_ID AS HOI_CONCEPT_ID, c.CONCEPT_NAME AS
HOI_CONCEPT_NAME,
c1.CONCEPT_ID AS CONDITION_CONCEPT_ID, c1.CONCEPT_NAME AS
CONDITION_CONCEPT_NAME
FROM CONCEPT c
JOIN CONCEPT_ANCESTOR ca
ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
JOIN CONCEPT c1
ON c1.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID
AND c1.CONCEPT_NAME=‘Ingredient’
AND c1.INVALID_REASON IS NULL
WHERE c.CONCEPT_ID IN
(
/YOU NEED TO DOUBLE CHECK THAT THESE ARE ALL THE OMOP HOI
CONCEPT IDS
/

500000402,500000404,500000801,500001003,500000308,500000303,500000802,500000805,500000301,500000403,500000405,500000401,500000302,500000305,500001001,500000307,500000803,500000306,500000304
)
)

SELECT z.DRUG_HOI_RELATIONSHIP, h.HOI_CONCEPT_ID, h.HOI_CONCEPT_NAME,
h.CONDITION_CONCEPT_ID, h.CONDITION_CONCEPT_NAME, z.DRUG AS
DRUG_CONCEPT_ID, z.RXNORM_DRUG AS DRUG_CONCEPT_NAME,
z.EVIDENCE_TYPE, z.MODALITY, z.STATISTIC_TYPE,
z.STATISTIC_VALUE, z.EVIDENCE_LINKOUT
FROM CTE_HOI h
JOIN (
SELECT *
FROM DRUG_HOI_EVIDENCE e
JOIN DRUG_HOI_RELATIONSHIP r
ON r.ID = e.DRUG_HOI_RELATIONSHIP
WHERE MODALITY = ‘t’
) z
ON z.HOI = h.CONDITION_CONCEPT_ID

ORDER BY h.HOI_CONCEPT_NAME, h.CONDITION_CONCEPT_NAME, z.RXNORM_DRUG,
z.EVIDENCE_TYPE

However, I did not get any results. Could you help me diagnose?

Actually, I figured out that I was trying to filter out the HOI concept IDs
with the label ‘Ingredient’, which is not possible. After editing it to
what it is now (see below). I was able to obtain the drugs (at the
‘Ingredient’ level) associated with each HOI:

WITH CTE_HOI AS (
SELECT DISTINCT c.CONCEPT_ID AS HOI_CONCEPT_ID, c.CONCEPT_NAME AS
HOI_CONCEPT_NAME,
c1.CONCEPT_ID AS CONDITION_CONCEPT_ID, c1.CONCEPT_NAME AS
CONDITION_CONCEPT_NAME
FROM CONCEPT c
JOIN CONCEPT_ANCESTOR ca
ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
JOIN CONCEPT c1
ON c1.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID
AND c1.INVALID_REASON IS NULL
WHERE c.CONCEPT_ID IN
(
/YOU NEED TO DOUBLE CHECK THAT THESE ARE ALL THE OMOP HOI
CONCEPT IDS
/

500000402,500000404,500000801,500001003,500000308,500000303,500000802,500000805,500000301,500000403,500000405,500000401,500000302,500000305,500001001,500000307,500000803,500000306,500000304
)
)

SELECT z.DRUG_HOI_RELATIONSHIP, h.HOI_CONCEPT_ID, h.HOI_CONCEPT_NAME,
h.CONDITION_CONCEPT_ID, h.CONDITION_CONCEPT_NAME, z.DRUG AS
DRUG_CONCEPT_ID, z.RXNORM_DRUG AS DRUG_CONCEPT_NAME,
z.EVIDENCE_TYPE, z.MODALITY, z.STATISTIC_TYPE,
z.STATISTIC_VALUE, z.EVIDENCE_LINKOUT
FROM CTE_HOI h
JOIN (
SELECT *
FROM DRUG_HOI_EVIDENCE e
JOIN DRUG_HOI_RELATIONSHIP r
ON r.ID = e.DRUG_HOI_RELATIONSHIP
* JOIN concept c on c.concept_id = r.drug*
WHERE MODALITY = ‘t’

  •            and c.concept_class_id ='Ingredient'*
      ) z
              ON z.HOI = h.CONDITION_CONCEPT_ID
    

ORDER BY h.HOI_CONCEPT_NAME, h.CONDITION_CONCEPT_NAME, z.RXNORM_DRUG,
z.EVIDENCE_TYPE

Think about it a bit differently, you have to take the drug that exists and convert it to an ingredient if needed. The way you are filtering now, I think, is you just narrow down to the evidence that comes in as ingredient.

Example, if evidence came in on the clinical drug "Itraconazole 100 MG Oral Capsule [Sporanox]" that needs to be translated, using the Vocabulary, into its ingredient "Itraconazole".

Please double check this but this should help you along you way.

/*FIRST LET'S JUST GET EVERYTHING THAT CAN GO TO THE INGREDIENT LEVEL TO THE INGREDIENT LEVEL*/
/*SINCE IT IS SLOW, WE'LL WRITE IT TO A TEMP TABLE*/
/*REMEMBER A CLINICAL DRUG THAT IS A COMBO DRUG WILL DUPILICATE THE EVIDENCE ACROSS THE COMBINATIONS, WE CANNOT ASSUME WHICH INGREDIENT SHOULD TAKE THE BLAME*/
SELECT r.HOI AS CONDITION_CONCEPT_ID, r.SNOMED_HOI AS CONDITION_CONCEPT_NAME,
	r.DRUG AS DRUG_CONCEPT_ID, r.RXNORM_DRUG AS DRUG_CONCEPT_NAME,  
	c.CONCEPT_ID AS INGREDIENT_ID, c.CONCEPT_NAME AS INGREDIENT_NAME,
	CASE WHEN c.CONCEPT_ID IS NULL THEN 0 ELSE 1 END INGREDIENT_EXISTS, 
	c2.CONCEPT_ID AS CLINICAL_DRUG_ID,c2.CONCEPT_NAME AS CLINICAL_DRUG_NAME,
	CASE WHEN c2.CONCEPT_ID IS NULL THEN 0 ELSE 1 END CLINICAL_DRUG_EXISTS, 
	e.EVIDENCE_TYPE, e.MODALITY, e.EVIDENCE_SOURCE_CODE_ID, e.STATISTIC_VALUE, e.EVIDENCE_LINKOUT, e.STATISTIC_TYPE
INTO TEMP TABLE TEMP_DRUG_HOI_EVIDENCE_W_INGREDIENTS
FROM DRUG_HOI_EVIDENCE e
	JOIN DRUG_HOI_RELATIONSHIP r
		ON r.ID = e.DRUG_HOI_RELATIONSHIP
	/*ROLL DOWN TO INGREDIENTS*/
	LEFT OUTER JOIN CONCEPT_ANCESTOR ca
		ON ca.DESCENDANT_CONCEPT_ID = r.DRUG
		AND ca.ANCESTOR_CONCEPT_ID IN (
			SELECT DISTINCT CONCEPT_ID
			FROM CONCEPT
			WHERE VOCABULARY_ID = 'RxNorm'
			AND CONCEPT_CLASS_ID = 'Ingredient'
			AND INVALID_REASON IS NULL
		)
	LEFT OUTER JOIN CONCEPT c
		ON c.CONCEPT_ID = ca.ANCESTOR_CONCEPT_ID
	/*ROLL UP TO CLINICAL DRUG*/
	LEFT OUTER JOIN CONCEPT_ANCESTOR ca2
		ON ca2.DESCENDANT_CONCEPT_ID = r.DRUG
		AND ca2.ANCESTOR_CONCEPT_ID IN (
			SELECT DISTINCT CONCEPT_ID
			FROM CONCEPT
			WHERE VOCABULARY_ID = 'RxNorm'
			AND CONCEPT_CLASS_ID = 'Clinical Drug'
			AND INVALID_REASON IS NULL
		)
	LEFT OUTER JOIN CONCEPT c2
		ON c2.CONCEPT_ID = ca2.ANCESTOR_CONCEPT_ID;

/*NOW THAT WE HAVE THAT TABLE, LET'S FILTER TO ONLY THE CONDITIONS IN THE HOIs OF INTEREST*/
WITH CTE_HOI AS ( 
	SELECT DISTINCT c.CONCEPT_ID AS HOI_CONCEPT_ID, c.CONCEPT_NAME AS HOI_CONCEPT_NAME, 
		c1.CONCEPT_ID AS CONDITION_CONCEPT_ID, c1.CONCEPT_NAME AS CONDITION_CONCEPT_NAME 
	FROM CONCEPT c JOIN CONCEPT_ANCESTOR ca 
		ON ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID 
	JOIN CONCEPT c1 
		ON c1.CONCEPT_ID = ca.DESCENDANT_CONCEPT_ID 
		AND c1.INVALID_REASON IS NULL 
	WHERE c.CONCEPT_ID IN ( /*YOU NEED TO DOUBLE CHECK THAT THESE ARE ALL THE OMOP HOI CONCEPT IDS*/
		500000402,500000404,500000801,500001003,500000308,500000303,500000802,500000805,500000301,500000403,500000405,500000401,500000302,500000305,500001001,500000307,500000803,500000306,500000304 
	) 
)
SELECT h.HOI_CONCEPT_ID, h.HOI_CONCEPT_NAME, i.*
FROM TEMP_DRUG_HOI_EVIDENCE_W_INGREDIENTS i
	JOIN CTE_HOI h
		ON h.CONDITION_CONCEPT_ID = i.CONDITION_CONCEPT_ID
WHERE MODALITY = 't'
ORDER BY HOI_CONCEPT_ID, CONDITION_CONCEPT_NAME, DRUG_CONCEPT_NAME, INGREDIENT_NAME, EVIDENCE_TYPE

Hi Erica,

I had executed your edited query for the 10 HOI concept IDs individually.

However, I got results with no value in for ‘modality’, even though
‘modality=t’ is set.

Still, I was able to filter out the unique drug concept IDs in Excel and R.

My results are shown below. In summary, there is not much of a difference
in the total # unique drug concept IDs from LAERTES test set, whether the
drugs are filtered at the ingredient level. Recall values are quite low -
around 3% for each of the HOI concept IDs.

No filtering at the ingredient level HOI concept ID hoi_concept_name
LAERTES: Total # unique drug concept IDs from test set (Evidence of
association) OMOP reference: Total # unique drug concept IDs from reference
set # Matches Recall 500000301 OMOP Acute Liver Failure 1 4843 117 117
0.024158579 500000302 OMOP Acute Liver Failure 2 4228 2 2 0.000473037
500000307 OMOP Acute Liver Failure 7 984 117 82 0.083333333 500000308 OMOP
Acute Liver Failure 8 4228 117 117 0.027672658 500000401 OMOP Acute Renal
Failure 1 2897 88 84 0.028995513 500000402 OMOP Acute Renal Failure 2 2897
88 84 0.028995513 500000404 OMOP Acute Renal Failure 4 2897 88 84
0.028995513 500000405 OMOP Acute Renal Failure 5 2897 88 84 0.028995513
500000801 OMOP Acute Myocardial Infarction 1 2732 102 99 0.036237189
500001001 OMOP Upper Gastrointestinal Ulcer 1 1863 91 81 0.043478261 Filter
at the ingredient level HOI concept ID hoi_concept_name LAERTES: Total #
unique drug concept IDs from test set (Evidence of association) OMOP
reference: Total # unique drug concept IDs from reference set # Matches
Recall 500000301 OMOP Acute Liver Failure 1 4670 117 116 0.0248394
500000302 OMOP Acute Liver Failure 2 4101 2 2 0.000487686 500000307 OMOP
Acute Liver Failure 7 970 117 82 0.084536082 500000308 OMOP Acute Liver
Failure 8 4101 117 116 0.028285784 500000401 OMOP Acute Renal Failure 1
2819 88 83 0.029443065 500000402 OMOP Acute Renal Failure 2 2819 88 83
0.029443065 500000404 OMOP Acute Renal Failure 4 2819 88 83 0.029443065
500000405 OMOP Acute Renal Failure 5 2819 88 83 0.029443065 500000801 OMOP
Acute Myocardial Infarction 1 2671 102 97 0.036315987 500001001 OMOP Upper
Gastrointestinal Ulcer 1 1828 91 81 0.044310722

Lili,

I think you are reporting low precision (P = TP/(TP + FP) values rather than low recall (R = TP/(TP + FN)). Would you please correct that in your tables and also add a column for the actual Recall as well as the balanced F-measure (2PR/(P+R)) which nicely balances P and R?

As for the low precision you are finding. It does look concerning (after all, there are only 13K ingredients!). However, I walked through the same process for DILI Definition 1 and think I might have identified some a clue. First, are are you using counting the ‘drug_concept_id’ column or the ‘ingredient_id’ column when you count unique drugs? You should be counting the latter. Second, Erica’s query for the HOIs expands the HOI lists using ancestor/descendant relationships. But, the SNOMED concept_ids are already expanded for the OMOP definitions. So, this is probably a better approach because it is more direct and shouldn’t introduce spurious HOIs (btw, I am not seeing the same issue with modality that your are):

  1. create the same TEMP_DRUG_HOI_EVIDENCE_W_INGREDIENTS table, and

  2. run a simple query (you get this kind of formatting using back tics before and after the query):

SELECT COUNT(DISTINCT ingredient_id) FROM TEMP_DRUG_HOI_EVIDENCE_W_INGREDIENTS WHERE CONDITION_CONCEPT_ID IN (448372,206254,205828,449466,206944,...<ALL CONCEPT IDS FROM THE CONCEPT ID COLUMN OF THE DILI DEFINITION> ) AND MODALITY = 't'

You would have to recalculate P, R, and F but the count looks a bit better (~3K ingredients, probably still not very good though for precision). Once we are sure that we are counting the right things, we will need to trace some of the false positives specifically back how it was brought in from sources.

I expect that the main issue is still a big part of the drug class mention problem that we have already discussed. A table we can use to filter out the drug-HOI associations that occur by class mapping is at this link. The table has HOIs as OMOP concept IDs for Mesh because of the source, so will require an additional step to get to SNOMED. I can have my part time analyst work on that. It will reduce the drug-HOIs from Medline to only those with exact drug mentions (rather than class mentions mapped to individual drugs) and will likely show much better precision but lower recall.

The next version of LAERTES (set for 1/30) will try a new approach that I think will greatly improve precision without hurting recall.

@rkboyce:

Not sure I understand the probem, Rich, but putting a manual step into an automated big machine sounds dangerous to me. Why does it introduce “a large number of false positive associations because of the need to go from drug class MeSH tags to individual drugs”? Do you have an example? Is there any heuristic we could invent?

There will be no manual step (and hence no danger :slight_smile: ).

An example would be Hypoglycemic Agents that would be mapped using the
MeSH Pharmacological Action Descriptors file
(http://www.nlm.nih.gov/mesh/pa_abt.html) to bunch of entities. A
MEDLINE abstract might mention a drug-HOI association involving a
specific hypoglycemic agent but use the class name as the MeSH tag. If
we expand to include all mapped agents, we would potentially introduce a
bunch of “evidence for associations” into LAERTES that really are not
justified given the source. However, in developing penelope, we found
that not doing the mapping leads to important “evidence for
associations” being missed. The solution I want to try is simple and
automated ( Test crosswalking MeSH drug class with semmed drug named entity mention to avoid false positives · Issue #74 · OHDSI/KnowledgeBase · GitHub). For cases
where the drug class is MeSH tagged, we will search for drug entities
mentioned in the text that were identified by Metamap (already available
in SemMedDB). If we find specific entities that are in the class, then
that will be “evidence for association” that is loaded into LAERTES.
This should keep the noise down while not missing important associations.

Hi Richard,

I’m still digesting the contents of your response over the holidays.

To address the initial part regarding renaming ‘recall’ to ‘precision’, I
don’t quite understand how ‘precision’ is the best suited to describe the
numbers. For example, for 500000301 OMOP Acute Liver Failure, I only got
116 records from the OMOP reference set to validate the 4670 records (with
unique drug concept IDs) from the LAERTES set, indicating that only 2.48%
of the test set could be validated. What are the true and false positives
in this validation? We had discussed at the last LAERTES community call
that the ‘modality=t’ in the LAERTES set represents the existence of a
record, but the information on whether the drug is positively or negatively
associated with the HOI isn’t available yet. So, we don’t really have true
or false positives to calculate precision. Therefore, the best term I
could apply to the 2.48% for 500000301 OMOP Acute Liver Failure is Recall
(or that the remaining 4554 could not be validated currently).

I had counted by ‘drug_concept_id’, not by ‘ingredient_id’. I will execute
the query again using the latter as well as your suggested query above.

Lili

Perhaps ‘% Representation’ would be a better word to describe the # records
validated?

@lilipeng,

Question for you, why are you taking the number of drugs in the OMOP reference set over the number of drugs found in LAERTES? The OMOP Reference Set is not the universe of drugs that cause an HOI and LAERTES is going to have many drugs that are just noise (i.e. we are pulling in FAERS, there will be reports for many random ingredients causing an HOI).

The OMOP reference set was suggested to me to use as the standard for validating LAERTES dataset. Sara Dempster and I have been working on this for the past few weeks.

Lili, For the LAERTES evidence base, is more about if the drug-HOI record has relevance to the question of if the drug and HOI are associated then if the system is saying there is an association. Since you are using the OMOP set as the reference, any drug-HOI evidence item returned by the LAERTES evidence base that is also in the reference set is a “TP” from an information retrieval perspective, which I think is the right perspective to conduct this analysis. The terms I recommend to you (and how to calculate them) are standard in information retrieval but, as you can see, TP and FP don’t have the exactly the same semantics here as you are probably used to. I hope that helps. If you can make those suggested changes, we can use your figures as a baseline from which to track improvements to the evidence base. I can also help you to refine the analysis to those drug-HOI records in LAERTES that are not the result of mapping from a drug class mention. This will help us see what impact a much more conservative approach to loading evidence has on retrieving evidence on drug-HOI associations of interest.

We have to be very clear about what validation means here. To me, your validation is useful to assessing how complete the LAERTES evidence base is wrt drug-HOI associations that have been established by some reasonable consensus process. What you are finding it seems is that it is relatively complete relative to the OMOP reference set (i.e., high recall) but has a great deal of noisy associations (i.e., low precision) Since the evidence base is nothing more than a collection of evidence, I don’t think we can do much more than this right now. However, that it is still very useful. With the baseline you are establishing, we can quantitatively show the effect of using different strategies to deal with issues during the ETL process.

On the other hand, Erica has worked on machine learning to predict drug-HOI associations using the evidence sources and statistics as features. This is like generating a knowledge base and therefore requires a different approach to validation. One that I think is more what you originally hoped to do.

t