Rules for concept inclusion in concept sets

jswerdel · July 16, 2025, 4:37pm

While exploring the use of LLMs to help create and curate concept sets, we stumbled on an interesting fundamental question: what logic should we (or AI) apply when deciding which concepts belong to a concept set?

Specifically, we have two alternatives:

Only include concepts that semantically fall within the main concept set idea. For example, for the concept set on “Nausea” we would include “Postoperative nausea” or “Exacerbation of nausea”. Often these are subtypes of the concept set idea.
Also include concepts that practically imply the concept set idea. For example, for “Nausea” we would also include “vomiting”, as in nearly all cases, vomiting implies nausea.

Option 2 does not necessarily mean lower specificity, as we require (almost) everyone who has the concept must also have the concept set idea (we currently instruct the LLM to use a threshold of 95% of people with the concept must have the idea). But it does have the potential to raise sensitivity. Option 2 would make most sense if the goal is to maximize the phenotype operating characteristics, but it does feel odd to have concepts in the set that are not subtypes of the main idea.

Below are some more examples. What do you think should be our policy for making concept sets?

Primary concept	Possible included concept	Rationale
Bleeding	Open fracture of ulna	Bleeding is a direct and expected consequence of an open fracture of the ulna due to the disruption of blood vessels and exposure of the bone. It is nearly guaranteed to occur in such cases.
Bronchitis	Haemophilus influenzae pneumonia	Haemophilus influenzae pneumonia logically implies bronchitis because the infection directly causes inflammation in the bronchial tubes, which is the defining feature of bronchitis. Therefore, nearly all patients with Haemophilus influenzae pneumonia would also meet the criteria for bronchitis.
Diarrhea	Gastroenteritis	Gastroenteritis is a condition that almost universally includes diarrhea as a symptom. In medical practice, the presence of gastroenteritis strongly implies the presence of diarrhea, as it is one of the defining features of the condition. Therefore, it is logical to conclude that 95% or more of patients with gastroenteritis have diarrhea.
Fatigue	Generalized myasthenia	Fatigue is a nearly universal symptom of Generalized myasthenia due to the disease’s underlying mechanism of impaired neuromuscular transmission. It is logical to conclude that more than 95% of patients with Generalized myasthenia experience fatigue.
Vertigo	Active Ménière’s disease	Vertigo is a nearly universal symptom of Active Ménière’s disease, as it is one of the defining features of the condition. The presence of vertigo is logically implied by the diagnosis of Active Ménière’s disease.

schuemie · July 18, 2025, 6:21am

@aostropolets , @Patrick_Ryan , @Azza_Shoaibi ?

aostropolets · July 29, 2025, 10:20pm

My take on it is that there is a difference between phenotypes for symptoms versus diseases. Symptoms by definition are not fully specified disorders and I would take any disease that is commonly associated with a given symptom as a part of my concept set. Diseases are a different thing.

To illustrate it using the examples above:
I want gastroenteritis in my diarrhea concept set (I’m looking for a symptom that is manifestation of many diseases)
I do not want influenza pneumonia in my bronchitis concept set (two distinct diseases)
I don’t know if I want open fracture in my bleeding concept set (as I probably am looking for a specific kind of bleeding: venous/arterial/small vessels/unusual prolonged bleeding from cuts/whatever)

I would hope that a properly constructed clinical definition should guide in selecting concepts in the cases above.

Christian_Reich · August 7, 2025, 11:38am

To add to @aostropolets’s distinction between symptoms and diseases (which is not a sharp line):

You need to look at the use case. What is it you want to study? Is the etiology important or not. Is sensitivity important or not? Is onset important or are you looking for prevalent problems or even susceptibilities? Unfortunately, there is no general answer to your question.

For example, take your bleeding. If you are studying the amount of loss of blood, the cause of the bleeding is irrelevant. If you are studying bleeding as a drug effect, then what you are really are after is diminished blood clotting, which leads to bleeding. An open fracture, or any trauma, even though it causes bleeding, is irrelevant.