OHDSI Home | Forums | Wiki | Github

Roll-up of for Table 1 type characterization of conditions: 20 clinical categories, similar to ICD10 chapters

Wondering if a repo or best practices guide ever came out of this discussion? In particular, the recommended way to get all patients of a certain condition where that condition might be a very general one (e.g. cancer) or more specific (brain cancer) or even more specific (GBM stage 4). Thanks!

Did a solution come out of this conversation?

We have a solution for the 20 ICD10-like top disease groups. This creates a statistical category (meaning no double membership, everything adds up to 100%) for each input concept_id:

 with disease as ( -- define disease categories similar to ICD10 C
  select 1 as precedence, 'Blood disease' as disease_category, 440371 as snomed_rollup union
  select 1, 'Blood disease', 443723 union
  select 2, 'Injury and poisoning', 432795 union
  select 2, 'Injury and poisoning', 442562 union
  select 2, 'Injury and poisoning', 444363 union
  select 3, 'Congenital disease', 440508 union
  select 4, 'Pregnancy or childbirth disease', 435875 union
  select 4, 'Pregnancy or childbirth disease', 4088927 union
  select 4, 'Pregnancy or childbirth disease', 4154314 union
  select 4, 'Pregnancy or childbirth disease', 4136529 union
  select 5, 'Perinatal disease', 441406 union
  select 6, 'Infection', 432250 union
  select 7, 'Neoplasm', 4266186 union
  select 8, 'Endocrine or metabolic disease', 31821 union
  select 8, 'Endocrine or metabolic disease', 4090739 union
  select 8, 'Endocrine or metabolic disease', 436670 union
  select 9, 'Mental disease', 432586 union
  select 9, 'Mental disease', 4023059 union
  select 9, 'Mental disease', 4293175 union
  select 10, 'Nerve disease and pain', 376337 union
  select 10, 'Nerve disease and pain', 4011630 union
  select 11, 'Eye disease', 4038502 union
  select 12, 'ENT disease', 4042836 union
  select 13, 'Cardiovascular disease', 134057 union
  select 14, 'Respiratory disease', 320136 union
  select 14, 'Respiratory disease', 4115386 union
  select 15, 'Digestive disease', 4302537 union
  select 16, 'Skin disease', 4028387 union
  select 17, 'Soft tissue or bone disease', 4244662 union
  select 17, 'Soft tissue or bone disease', 433595 union
  select 17, 'Soft tissue or bone disease', 4344497 union
  select 17, 'Soft tissue or bone disease', 40482430 union
  select 17, 'Soft tissue or bone disease', 4027384 union
  select 18, 'Genitourinary disease', 4041285 union
  select 19, 'Iatrogenic condition', 4105886 union
  select 19, 'Iatrogenic condition', 4053838 union
  select 19, 'Iatrogenic condition', 444199 union
  select 20, 'Not categorized', 441840
select distinct -- get the disease category with the lowest (best fitting) precedence, or assign 'Other Condition'
  concept_id, concept_name,
--   first_value(coalesce(disease_id, 0)) over (partition by concept_id order by precedence nulls last) as disease_id, 
  first_value(coalesce(disease_category, 'Other Condition')) over (partition by concept_id order by precedence nulls last) as disease_category
from concept
left join ( -- find the approprate disease category, if possible
  select descendant_concept_id, snomed_rollup, disease_category, precedence
  from concept_ancestor 
  join disease on ancestor_concept_id=snomed_rollup
) d on descendant_concept_id=concept_id
where concept_id in (4001903, 22856, 40482052, 4275588) -- place here the concept_ids you want to roll up (have to be standard SNOMED)

We were also thinking of a more fine-grained categorizer, with approximately 100 categories. This has not been done yet.

1 Like

The finer grained categories would be very useful :slight_smile: Let me know when you get to it.

When using the above should we further aggregate just by the disease_name since there are multiple disease_id with identical disease_names? are were these intended to be separate categories? I’m guessing the former since it adds up to 20, 19 disease_names + other. But wanted to be sure I am using this correctly. Thanks


You are correct, that is confusing. I changed the script. The disease_name and disease_id sound like they are equivalent, but they are not. The former is the category (does not exist as a concept), and the latter is a rollup concept in SNOMED. Each category can have one or more of them, and the precedence may or may not be the same. Take the script as it stands now.

Yes, it’s only 19. No idea how that happened, I probably merged a category that was hard to tease apart. The “Other” - not sure there are any. Let me find out.

The 100 categories: Happy to create an equivalent one (and put both into a Github repo), but do you know of a good starting point?

Checked out the conditions that are falling through the cracks. Looks like a list that should: Complications (without knowing what), very generic conditions (disorder of body cavity), signs and symptoms, abnormal measurements. Added a few to the categories and added a “non-categorized” category in the updated the script above.

Instead of taking standardized data and organizing it by a non-standard vocabulary hierarchy, could we identify what are the 20-100 clinical ideas that we’d want to have in a “Table 1 type characterization” and phenotype them properly? I thought @agolozar was trying to make progress on that, but don’t know the status of it.

This would definitely be a useful resource. Happy to help

Phenotypes? I am confused. Phenotypes usually are well defined conditions that you use in lieu of Condition concepts in CONDITION_OCCURRENCE. You spend all that effort to overcome the their shortcomings to have the best outcomes for your study.

Here, we are talking categories for co-morbidity reporting of cohorts. Yes, this script does a quick and dirty job, but probably good enough for the task. The list is based on slightly modified ICD10 Chapters.

If you want to do those as phenotype definitions - I don’t even know how you would do that any other way – using the SNOMED hierarchy to combine Condition concepts plus an order of precedence so that each Condition gets placed into one category only. There are tens of thousands of individual conditions. Do you want to run them all through the @Gowtham_Rao program? I am open to ideas.

Happy Saturday night @Christian_Reich

This is some work done there

We are looking for volunteers to take this thru our peer review process

Exactly. These are individual conditions, or small combinations of a few of them. That is probably useful for some questions. But again, there are thousands of such conditions. You will need a ton of volunteers to cover the space.

The categories I created are meant for co-morbidity characterization across all of medicine. The list of 20 is very broad. Hence the idea to go down a level and create a similar mechanism with some more granularity.

It’s a complex problem @Christian_Reich

My concern is the proposal to use it in table 1. I think we need to use cohort definitions after understanding the measurement error. But this is very difficult and labor intensive, as you said …but if the list is finite we can do it.

your process sounds simple and is implementable… it’s probably good enough to get a rough estimate of characteristics. It has value.

But we have already defined the finite list that we think should go into table 1. We have already phenotyped most of them. It only needs to go thru the peer review process

So this cohort based solution is within reach and can be completed this year …so why not focus our collective energy on that idea?

@Christian_Reich , my proposal is simple. Can we just enumerate a list of the 20 - 100 clinical ideas that we want to have in a Table 1 to characterize ‘all of medicine’ as you describe it. If we had that list of target ideas, then we could work in parellel to consider whether there is a ontology-based solution that can provide a sufficient approach and we can also just directly phenotype them, as @Gowtham_Rao provided that link to with the other effort.

For what its worth, I think there’d be A LOT of value in simply enumerating phenotypes that we want. Here, we’re largely talking about common comorbidities, and that’s certainly a good list. Separately, we need to enumerate the list of outcomes that we want to be able to do safety surveillance and comparative effectiveness, including what @hripcsa was proposing with howoften.org. We also would like to have a list of indications that we want to march through, as we’ve done LEGEND for hypertension and T2DM. Separate from that, we need to enumerate the set of covariates/features that we’d want to consider for patient-level predictive modeling, as @jennareps has asked for in the past. Now, these lists will likely have overlaps, but they’ll also likely have items that are more relevant for one use case than another. But if we could define some universe (even if just an initial starting point), then we’d at least have a target to work toward.

As a starting point, I’ll just remind folks about the clinical ideas that we currently include in ‘Table 1’ from the standard output of CohortMethod. @schuemie did a great job of implementing a solution that could have taken any list of concepts (+descendants), but if you gotta beef with the list, that’s 100% on me, because I was the one that picked them, based on reviewing Table 1’s from a collection of published papers and trying to make an intersection list, and then traversing SNOMED (for conditions) and ATC (for drugs) to determine if I could find a concept that was a ‘good enough’ approximate for the clinical idea of interest. We definitely created this list as a starting point strawman, didn’t intend it to become a de facto standard, but we haven’t seen anyone suggest other ideas. And yet, when we go through this clinical concepts, we already know for many of them that using concept+descendant is quite problematic (from Phenotype Phebruary, you can see the issues with ADHD, diabetes, depression), so could be improved by replacing the concept-based approach with a proper phenotype. But we can’t phenotype what we don’t define, so maybe getting the community’s input on the list of target ideas would move this conversation forward.

  • Demographics:
    – Age group (5-year buckets)
    – sex
    –(probably should also include race, ethnicity, and index year)
  • Medical history (conditions):
    – Acute resiratory disease
    – Attention deficit hyperactivity disease
    – Chronic liver disease
    – Chronic obstructive lung disease
    – Crohn’s disease
    – Dementia
    – Depressive disorder
    – Diabetes mellitus
    – Gastroesophageal reflux disease
    – HIV infection
    – Hyperlipidemia
    – Hypertensive disorder
    – Lesion of liver
    – Obesity
    – Osteoarthritis
    – Pneumonia
    – Psoriasis
    – Renal impairment
    – Rheumatoid arthritis
    – Schizophrenia
    – Ulcerative colitis
    – Urinary tract infectious disorder
    – Viral hepatitis C
    – Visual system disorder
  • Medical history (cardiovascular disease):
    – Atrial fibrillation
    – Cerebrovascular disease
    – Coronary arteriosclerosis
    – Heart disease
    – Heart failure
    – Ischemic heart disease
    – Peripheral vascular disease
    – Pulmonary embolism
    – Venous thrombosis
  • Medical history (neoplasms):
    – Hematologic neoplasm
    – Malignant lymphoma
    – Maligant neoplasm of abdomen
    – Malignant neoplastic disease
    – Malignant tumor of breast
    – Malignant tumor of colon
    – Malignant tumor of lung
    – Malignant tumor of urinary bladder
    – Primary malignant neoplasm of prostate
  • Medication use
    – Antibacterials for systemic use
    – Antidepressants
    – Antiepileptics
    – Antiinflammatory and antirheumatic products
    – Antineoplastic agents
    – Antipsoriatics
    – Antithrombotic agents
    – Beta blocking agents
    – Calcium channel blockers
    – Diuretics
    – Drugs for acid related disorders
    – Drugs for obstructive airway diseases
    – Drugs used in diabetes
    – Immunosuppressants
    – Lipid modifying agents
    – Opioids
    – Psycholeptics
    – Psychostimulants, agents used for adhd and nootropics

Happy to help. What do have to do?


Looks like I need to back out a bit here of a debate that wasn’t the intention. Or better yet, jump into that one as well. There is nothing better than a 2-front war. :slight_smile:

Debate #1: Characterize the conditions of a population. That’s what @pandamiao started this debate with. Skip this if you are interested in the Table 1 discussion.

This solves the following problem: If you have a bunch of patients and you want to summarize what diseases they have. Since there are many of them, you need categories. The above script puts each Condition concept into one of 20 clinically meaningful statistical categories. The categories are:

  • Blood disease
  • Injury and poisoning
  • Congenital disease
  • Pregnancy or childbirth disease
  • Perinatal disease
  • Infection
  • Neoplasm
  • Endocrine or metabolic disease
  • Mental disease
  • Nerve disease and pain
  • Eye disease
  • ENT disease
  • Cardiovascular disease
  • Respiratory disease
  • Digestive disease
  • Skin disease
  • Soft tissue or bone disease
  • Genitourinary disease
  • Iatrogenic condition
  • Not categorized

These categories are very similar, but not exactly the same as the 22 ICD-10 Chapters. The differences are threefold:

If you want to use the script, join the concept_id of the main select to the condition_concept_id of your CONDITION_OCCURRENCE table and count up the different category_name occurrences. The total will be equal to the total rows in the table, so the % add up to 100.

The next step is to build a similar script, but with about 100 finer grained categories. Otherwise same idea.

Thanks @Christian_Reich , this is really helpful. If one is looking to group conditions into very high-level categories, then the 20 you list here make sense and I agree with you that an ontology-based approach is the preferred approach, because these aren’t really distinct, well-defined, clinical ideas, but rather broad buckets to classify conditions (and there’s nothing inherently ‘right’ or ‘wrong’ about the classification, it is what it is). I don’t think that we could or should try to phenotype broad-based categories like this.

Now, when we start going the next level down, from 20 → 100, I think it’ll be interesting to figure out if the ontology is sufficient or if that’s when we cross into phenotyping range. But, we need to look at what those 100 ideas are to make that determination.

Thank you, @Christian_Reich and @Patrick_Ryan for describing your perspectives. I was also confused when the idea of phenotyping was brought up when I think what @Christian_Reich was trying to get to was a category of disease system and not a set of clinical ideas that would populate a Table 1 report.

But, we can probably merge these ideas: We like to group conditions into categories because it makes higher-level description of a population easier to grasp when you talk in terms of the different categories of disease in it vs the low-level clinical ideas. But, we also like to accurately identify the people in the population who have the disease in question, so we need phenotypes. So, if we can define phenotypes for as many clinical ideas, and then define a categorization scheme to group these clinical ideas into higher level categories, then a table 1 can be at the category level which identifies the population that has any of the clinical ideas (phenotypes) in the given category.

Debate #2: Build Table 1.

Patrick questioned whether this categorization script can be used to build a Table 1, or whether you better use phenotypes for that. Even though that was not the initial purpose of the script, it actually is a good question.

Let’s continue this debate in the following new Forum post, otherwise we will get totally confused.

Agree. Let me come up with some choices. We could cut up the 20, each into 5ish, or we go data driven and balance them a little better so each category is of similar size, or we use ICD10 or another vocab as an example.

That’s probably what we will end up with. Some generic categories, and some detailed phenotypes.

1 Like