OHDSI Home | Forums | Wiki | Github

What is the difference between 'using all' and 'using distinct' in the cohort definition?


(panpan) #1

dear all,
when I use the atlas to construct a cohort, I find I can’t understand the difference between using distinct and using all like the two pictures. what is it?
thank you very much!



(Eldar Allakhverdiiev) #2

Hi @pandamiao,

Let’s assume we have a concept_set with >1 concept_id. And we have 2 patients:
Person A has 3 occurrences of only one concept_id,
Person B has 2 occurrences: once for each concept_id
When ‘using all’ is chosen, then there is no matter how much distinct concepts are found. So there are 3 ‘all’ events for person A and 2 events for person B.
On the other side, when ‘using distinct’ is enabled, then cohort builder is counting how much distinct concepts are met per person.
In this case, we have just 1 ‘distinct’ event for person A, and still 2 events for person B.

Speaking about use cases: it’s often required to find out treatment resistant patients ( and resistance is defined as using >N different drugs of a certain drug group. ) Let’s say we would like to find out patients who used > 3 antidepressants during year.
One way: create separate concept sets for each active ingredient, make a complex Drug exposure criteria for each of them, and combine them into a group with logic ’ having at least 3 of the following criteria’ :

It will take a day to create concept sets, one more day to do the cohort definition. Odd waste of time and coffee :sweat_smile:

To avoid this, we can create just one concept set which includes all ingredients and use it in Drug Era criteria, with ‘using distinct’ chosen :

Seems much more efficient.
If dosage also matters, then we can update our cohort to capture different dosages of drugs and use ‘Dose Era’


(Chris Knoll) #3

Eldar has explained it well. Thanks @Eldar.

At a more basic level: you have 2 choices of ‘counting things’ when you have your ‘window criteria’ (the type of criteria where you say ‘have at least 1 occurrence of {Condition} Between 30 days before and 0 days before index’.

The first choice is just to just count all the observed events. This is the ‘using all’ option next to ‘with at least N’…

The other choice is to count distinct ‘concept_ids’ from the observed events. This is the ‘using distinct’ option. Each cohort criteria type looks for a specific field in the CDM domain’s table to use for the distinct value: condition uses ‘condition_concept_id’… drug exposure uses ‘drug_concept_id’. visits use ‘visit_concept_id’. I tried to make it consistent but it doesn’t always satisfy all needs. Additional ‘distinct options’ that I’d like to put in:

  • Distinct visits (distinct visit_occurrence_id)
  • distinct dates (distinct {domain}_start_date ie: distinct condition_start_date)
  • Distinct visit dates (distinct visit_start_date of the associated visit).

that will make it easier to do things like ‘at least 3 occurrences on different dates in the past 6 months’.

It’s on the roadmap…but I think this would be useful.


(Mark Danese) #4

@Chris_Knoll the “on distinct dates” part can be important, so glad to see it is on the roadmap. Duplicates (or things that look like duplicates) can be relatively common, depending on how the ETL is done. In claims data, for example, one can get two records for an infused medication on the same day. One with a modifier and one without. For example there is a modifier for wastage so one might see a record for the amount used and a second record for the amount wasted with the quantities for the two records summing to the amount billed/reimbursed. It happens with chemotherapy a lot, but depends on whether wastage is required to be reported. (We just went through this exercise as part of putting together chemotherapy regimens.)


(panpan) #5

:grin: thank you very much @Eldar, get it!


(panpan) #6

yeah, thank you very much @Chris_Knoll, I get what you said, and I also think it is useful.


t