OHDSI Home | Forums | Wiki | Github

Finding Broad Concepts

Looking to see if anyone has some thoughts on this topic . . . .

@Christian_Reich, @Dymshyts, @aostropolets and myself have been toying around with the idea of flagging condition concepts that are too broad for analytical purposes. For example:

4272240	/*Malaise*/
4309912	/*Generally unwell*/
443403	/*Sequela*/

Reason this is of interest to me is I’m trying to generate lists of negative controls, but a negative control of “sequela” is “too broad” and doesn’t make for a good negative control. We want things more specific like “lymphoma, b-cell”.

Right now I’ve written a basic query that pulls “too broad” concepts that either have more than 1000000 exposures in a large claims database or certain key phrases like “finding” or “by site”. But any other thoughts are welcome.

SELECT *
IFROM @conceptUniverseData c1
WHERE PERSON_COUNT >= 1000000
OR UPPER(c1.CONCEPT_NAME) LIKE '%FINDING%'
OR UPPER(c1.CONCEPT_NAME) LIKE '%DISORDER OF%'
OR UPPER(c1.CONCEPT_NAME) LIKE '%INJURY%'
OR UPPER(c1.CONCEPT_NAME) LIKE '%DEAD%'
OR UPPER(c1.CONCEPT_NAME) LIKE '%SYMPTOMS%'
OR UPPER(c1.CONCEPT_NAME) LIKE '%DISEASE OF%'
OR UPPER(c1.CONCEPT_NAME) LIKE '%BY SITE'
OR UPPER(c1.CONCEPT_NAME) LIKE '%BY BODY SITE'
OR UPPER(c1.CONCEPT_NAME) LIKE '%BY MECHANISM'
OR UPPER(c1.CONCEPT_NAME) LIKE '%OF BODY REGION%'
OR UPPER(c1.CONCEPT_NAME) LIKE '%OF SPECIFIC BODY STRUCTURE%'
OR c1.CONCEPT_ID IN (
/*ADDED BY HAND*/
  	313878, /*Respiratory symptom*/
  	198194, /*Female genital organ symptoms*/
  	135033, /*Hair and hair follicle diseases*/
  	443949, /*Disease type AND/OR category unknown*/
  	4272240,	/*Malaise*/
    4309912,	/*Generally unwell*/
    4047120,	/*Disorders of attention and motor control*/
    436222,	/*Altered mental status*/
    443403,	/*Sequela*/
    77673,	/*Sign or symptom of the urinary system*/
    4164707,	/*Canceled operative procedure*/
    40490404,	/*Adverse reaction to biological substance*/
    437758,	/*Dependence on enabling machine or device*/
    4106092,	/*Carrier of disorder*/
    4036154,	/*Comfort alteration*/
    433600,	/*Problem, abnormal test*/
    40492403,	/*Superficial foreign body*/
    433656,	/*Abnormal patient reaction*/
    4192174,	/*Illness*/
    4022204,	/*Effect of foreign body*/
    4031958,	/*Trace element excess*/
    4221798,	/*Allergic disorder by allergen type*/
    440005, /*Complication of medical care*/
    4201705,	/*Sequela of disorder*/
    4102111,	/*Mass of body structure*/
    4339468,	/*Ear, nose and throat disorder*/
    4208786,	/*Musculoskeletal and connective tissue disorder*/
    4266186,	/*Neoplasm and/or hamartoma*/
    4134440,	/*Visual system disorder*/
    252662,	/*Tracheobronchial disorder*/
    432586,	/*Mental disorder*/
    4160062,	/*Disorder characterized by pain*/
    4090739,	/*Nutritional disorder*/
    4180154,	/*Female reproductive system disorder*/
    442019,	/*Complication of procedure*/
    4113999,	/*Mass of body region*/
    438112,	/*Neoplastic disease*/
    4178431,	/*Cartilage disorder*/
    40481517,	/*Mass of soft tissue*/
    444208,	/*Chronic inflammatory disorder*/
    435227,	/*Nutritional deficiency disorder*/
    4028244,	/*Chronic disease of cardiovascular system*/
    4288734,	/*Bronchiolar disease*/
    4134595,	/*Chronic disease of genitourinary system*/
    4024558,	/*Disorder associated with menstruation AND/OR menopause*/
    4018852,	/*Acute genitourinary disorder*/
    440059,	/*Recurrent disease*/
    4134593,	/*Chronic digestive system disorder*/
    4168498,	/*Deformity*/
    4022830,	/*General problem AND/OR complaint*/
    4116964,	/*Mass of musculoskeletal structure*/
    4180645,	/*Connective tissue disorder by body site*/
    4134596,	/*Chronic mental disorder*/
    4145825,	/*Anorectal disorder*/
    378444,	/*Hearing disorder*/
    45772120,	/*Gastroduodenal disorder*/
    4115105,	/*Mass of respiratory structure*/
    436677,	/*Adjustment disorder*/
    4051956,	/*Vulvovaginal disease*/
    432250,	/*Disorder due to infection*/
    4134294,	/*Acute inflammatory disease*/
    43021226,	/*Hypersensitivity condition*/
    4168335,	/*Wound*/
    4028367,	/*Acute disease of cardiovascular system*/
    40488439,	/*Abnormality of systemic vein*/
    4239975,	/*Myocardial disease*/
    4338120,	/*Altered bowel function*/
    376961,	/*Disturbance of consciousness*/
    434621,	/*Autoimmune disease*/
    443240,	/*Collapse*/
    4167096,	/*Bone inflammatory disease*/
    444201,	/*Post-infectious disorder*/
    432585,	/*Blood coagulation disorder*/
    135526,	/*Spinal cord disease*/
    4116208,	/*Choroidal and/or chorioretinal disorder*/
    444199,	/*Iatrogenic disorder*/
	  4181217,	/*Sequelae of disorders classified by disorder-system*/
	  4206460	/*Problem*/
)

Hello,

Thank you for your email regarding [OHDSI Forums] [Researchers] Finding Broad
Concepts.

I am currently at a training and I have limited email access. I will respond
to your e-mail upon my return on Friday, November 17th.

Thanks for your patience,
Valerie.

Hi @ericaVoss !

Not sure that goal is clear for me - in my mind list of negative controls should be specific to each study.
But anyway, you can try to use concept_ancestor for this purposes.
The first idea is to exclude concepts with no ancestors, then also try to use some threshold for min_levels_of_separation
(as I know such principle is used for grouping covariates in FeatureExtraction).
Or just to include concepts with the lowest position in hierarchy (without descendant ones).

Hi, @Eldar,
I think using the lowest position in the hierarchy will be too narrow. I agree with you that just using the concept_ancestor for this should be enough (you shouldn’t need to go to the actual claims data to look for what exposures have been mapped to determine broadness or narrowness, although maybe it would serve as a heuristic).

I don’t have a specific solution but I do recall that some branches of the hierarchy goes very deep while others are very shallow. If you you wrote a query like this:

select concept_name, ancestor_concept_id, count(*)
from concept_ancestor ca
JOIN concept c on ca.ancestor_concept_id = c.concept_id
WHERE c.domain_id = 'CONDITION'
group  by concept_name, ancestor_concept_id
having count(*) < 10 and count(*) >= 5
order by count(*) desc

I just ran this on our internal vocabulary, it found 6334 things that are 'not too broad ’ (ie < 10 descendants) and not too narrow (ie, there’s at least five descendants at this level). Maybe if you wrap this query in a CTE, you could join this to your claims data, and find out which diagnoses roll up to one of those higher level groups.

We have done some work on “preferred ancestor” with NIH Research Entity Dictionary.

Can you describe more on how you construct your table “@conceptUniverseData”. E.g., SNOMED root concept will clearly have too much data.

We have simply defined tiers of ancestor concepts by # of concepts or # of patients(or rows (rows)) and used the best tier for a given purpose.

Thanks everyone!

@Eldar we were thinking that way too, to use levels of separation however its gets messy because the SNOMED hierarchy isn’t like MedDRA, one hop in one path can get you very specific where in another path you may need to take many hops to a specific code. It is hard to “shave off the top” of “too broad” concepts in conditions.

@Chris_Knoll thanks for this! I’m going to play around with this idea and see if I can incorporate it.

@Vojtech_Huser I’d be interested in understanding more because I’m not 100% clear from this description.
For @conceptUniverseData let’s say for a given drug exposure you are looking for condition negative controls, I use a claims dataset to find people who are exposed to the drug of interest and set my starting “universe” to conditions that occur any time after the exposure. This helps me narrow what I’m looking at, no point in reviewing negative controls for your cohort of exposed people who then never have the condition.

I played around a bit with @Chris_Knoll’s query and I think I’ve landed on this to find “too broad”

select concept_name, ancestor_concept_id AS CONCEPT_ID, count(*) AS RELATIONSHIP_COUNT
from concept_ancestor ca
	JOIN concept c 
		on ca.ancestor_concept_id = c.concept_id
		AND c.STANDARD_CONCEPT = 'S'
WHERE c.domain_id = 'CONDITION'
group  by concept_name, ancestor_concept_id
having count(*) > 45

My query and Chris have a lot of agreement but this definitely pulls in a lot more vague concepts like “Soft tissue lesion” and “Inflammatory disorder”.

This doesn’t capture the “Injurgy of” or “Finding” so I’ll just append Chris’ query to mine.

https://github.com/OHDSI/CommonEvidenceModel/blob/master/postProcessingNegativeControls/inst/sql/sql_server/broadConcepts.sql#L19

@rkboyce mentioned he may have some additional ideas.

Friends:

Should we bring this up with SNOMED? I am sure they have had that problem in one form or another as well?

@Christian_Reich - yeah if you think they would be willing to help.

@rkboyce taught me about how to get top relationships in SNOMED, so I’ve added that:
https://github.com/OHDSI/CommonEvidenceModel/blob/master/postProcessingNegativeControlsPrep/inst/sql/sql_server/broadConcepts.sql#L37-L42

This helped with the broad drugs:

t