OHDSI Home | Forums | Wiki | Github

What is "the magic" in cohort definitions

The ‘magic’ comment in the code is a sort of ‘personal signature’ I left in the code just to make it distinctive.

This question comes up a bit, so I’ll provide some detail here:

The ‘magic’ is the algorithm that can take a set of start and end dates and return which of those dates are the dates when all starts have been ended, accounting for overlapping periods. So if we have 4 date ranges that produce this timeline:

|-----------|
      |---------|
   |--------------|
                         |--------|

This plot might would look like this in the data:

start_date, end_date
2010-01-01,  2010-06-01
2010-03-01,  2010-07-01
2010-02-01,  2010-08-01
2010-11-01, 2010-12-01

We can see easily that there’s 2 ‘eras’ here with a gap, spanning 2010-01-01 to 2010-08-01 and finally 2010-11-01 through 2010-12-01.

The ‘magic’ algorithm arranges the start and end dates in order, such that you have 3 columns: the eventDate, start_ord and overall_ord where start_ord is the ordinal of the start date and overall_ord is the row number of the set of dates. From the above sample data, it looks like this:

event_date, start_ord, overall_ord
2010-01-01, 1, 1
2010-02-01, 2, 2
2010-03-01, 3, 3
2010-06-01, 3, 4
2010-07-01, 3, 5
2010-08-01, 3, 6
2010-11-01, 4, 7
2010-12-01, 4, 8

(see how overall ord is the row number of each date, while start ord only is the row number from the list of start dates).

From this arrangement, you can see that the row that has 2 * start_ord = overall_ord (or, rewritten: 2* start_ord - overall_ord = 0) means that that date is an end date of an era. Using those ‘magic dates’ you can then find the earliest start_date that belongs to the given end date, and those dates are your eras. In this case it would be:

era_start. era_end
2010-01-01, 2010-08-01
2010-11-01, 2010-12-01

It’s like magic!

-Chris

8 Likes

@hripcsa

I have spent an entire weekend learning and admiring this code snippet from @chris_knoll . It is complex, elegant and efficient all at the same time. It’s truly amazing. It deserves to be called magic.

2 Likes

I totally agree with @Gowtham_Rao!! A summer student and I were head-scratching on EXACTLY that part of the cohort definition code for the treatment pathways project. I NEVER, EVER, EVER, EVER, EVER would have figured it out w/o @Chris_Knoll’s posting.

More seriously, Chris – did you figure this out yourself or does this come from some Hogwarts School of Witchcraft and Wizardry book…

4 Likes

Awesome! Thanks so much for the tutorial and very neat.

1 Like

Yes, I implemented the same sort of strategy in the treatment pathways algorithm. When you need to transform a pattern of A-B-B-B-C-B-B-C-C-C-D into A-B-C-B-C-D, this technique can be applied.

The approach was derived from some research I did on solving the problem of finding the end of eras from this content:

-- Adapted from calculating concurrent sessions query found at http://sqlmag.com/t-sql/calculating-concurrent-sessions-part-3
-- Credits to Ben Flanaghan, Arnold Fribble, and R. Barry Young

They were trying to solve the problem of 'how many sessions were concurrently active. I flipped it around to say when are there no sessions active. I think I improved on the work, however, but I’ll let the other readers bet he judge.

I’ve attributed that work in my gist found here:

-Chris

2 Likes

I guess I shouldn’t leave that treatment pathway statment hanging, so let me demonstrate how that works:

Given the pattern:
A-B-B-B-C-B-B-C-C-C-D
and you need to remove repeats, and reduce it to:
A-B-C-B-C-D

Step 1: rownumber the events:

element, rn
A, 1
B, 2
B, 3
B, 4
C, 5
B, 6
B, 7
C, 8
C, 9
C, 10
D, 11

Next, add a rownumber, but partition by element to the result

element, rn, p_rn
A, 1, 1
B, 2, 1
B, 3, 2
B, 4, 3
C, 5, 1
B, 6, 4
B, 7, 5
C, 8, 2
C, 9, 3
C, 10. 4
D, 11, 1

Notice that duplicates get grouped when you subtract rownum - partitionrow:

element, rn, p_rn, rn-p_rn
A, 1, 1, 0
B, 2, 1, 1
B, 3, 2, 1
B, 4, 3, 1
C, 5, 1, 4
B, 6, 4, 2
B, 7, 5, 2
C, 8, 2, 6
C, 9, 3, 6
C, 10. 4, 6
D, 11, 1, 10

Now if you select element, min(rn) FROM … group by element, rn-p_rn you get:

A, 1
B, 2
C, 5
B, 6
C, 8
D, 11

Ordering by the initial RN, you get
A-B-C-B-C-D

-Chris

1 Like

Chris, your implementation here is an application of a sound software engineering principle - build your solution into data, rather than algorithm. Also, it’s a wonderful example of “relational” programming. My hat is off. Note that application of this principle does not prevent hiding the whole thing in algorithm later, as you imply.

Amazing!

@Chris_Knoll

Question, since this “Cohort Collapse Strategy” is in ATLAS, on the implementation screen it looks like it is always trying to apply it with a default gap day of 0.

But then the Print Friendly says:

If it is set to 0 is it selecting the end of the OBSERVATION_PERIOD_END_DATE or still attempting to collapse with a gap size = 0?

Once the individual cohort episodes are identified (starting with initial events, passing through inclusion criteria, and finally applying exit strategy), you may be left with cohort episodes that overlap. Since overlapping episodes is forbidden, we always collapse the individual cohort episodes into non-overlapping episodes. By default, we allow 0 days of gap between cohort episodes. You can adjust it using the ‘collapse gap size’.

I’m not certain. But it looks like the logic is the next:
Firstly you specify Cohort Exit Criteria where cohort_end_date is defined
And only then the logic of collapsing rows is applied.
Thus ‘Cohort Collapse strategy’ doesn’t touch cohort_end_dates itself.
I.e: you can have 2 or more records for subject with different start_dates (as well as end_dates).
If you decided to collapse them - then these 2 rows will transform into 1 with min(cohort_start) and max(cohort_end)
If no - they will both stay as they are.

The collapse strategy language may be missing from the print friendly, but your editor screen capture is about the collapse strategy and your print friendly screen capture is referencing the end date strategy.

End Date Strategy is use to specify when, after the initial event passes inclusion criteria, how long you should consider the event’s episode before the episode ends. By default, if you don’t specify how an initial event should end it’s episode, then the episode will be defined as from the event start to the end of the event’s observation period.

2 Likes

This is correct, the collapse strategy won’t generate any new end dates. But, if there are multiple episodes that overlap, then you can consider the ‘collapsed episode’ to be the MIN(start_date) and MAX(end_date) of all episodes that are overlapping each other.

@Eldar: I just wanted to add: you don’t really have the choice to collapse them. We always do because we must ensure that there are no overlapping episodes in the cohort results. You can decide to change the ‘allowable gap’ between episodes to bring separate episodes together within a gap, but the logic must always remove overlapping episodes.

2 Likes

So now I’m not uncertain. :grinning:
Thank you

1 Like

Thanks @Chris_Knoll - if I read more carefully I would have seen I was messing up “Cohort Collapse Strategy” and “End Date Strategy”. If my eye traveled down only 100 more pixels I would have seen:

@Chris_Knoll could you give me an example of where I would use this feature. I’m sorry, I know you told me before and now I’m blanking. I was thinking COHORT_END_DATE and that is obviously not right. :frowning:

if you wanted to identify episodes of backpain as a moment that you are diagnosed with backpain, and you assume that if it continues, the’ll have a return visit within 30d with backpain. So you would define your intial events as ‘condition occurrence of ‘Backpain’’, with an exit streategy of ‘fixed date of 30d after event date’. This is just saying ‘Backpain lasts 30 days’. Now if they come in < 30 days for another visit with backpain, you’ll extend the backpain episode another 30d.

What if they come in on day 32? It would be a new episode with the default 0d gap. If you set the collapse gap to 14d then backpain episodes within 14d of each other will collapse into a single back pain episode starting with the first overlapping episode and ending at the end of the last episode (30d after the back pain diagnosis).

Once I have that collapsed time of back pain how/where can I use it in ATLAS?

Let’s say there is a subject with multiple drug era’s of chosen drug:
the first era: 2010-01-01 to 2010-05-01
the second one: 2010-05-25 to 2010-06-25

You are taking drug_era_start date as cohort_start_date, limiting the cohort to all events and choosing Cohort Exit Criteria based on eras of persistence exposure.

If Cohort Collapse strategy is not applied (or gap size is less than 24 days) then the cohort will contain 2 rows for this subject with cohort_start and end_dates equal to corresponding drug_era_start and end_dates.

If gap is more - there will be only 1 record:
cohort_start_date = 2010-01-01 cohort_end_date = 2010-06-25

For this case the same result will be returned if Cohort Exit Criteria is not specified (and thus is taken by default as the end of observation period) (unless these drug_era’s are within the same observation_period)

If these 2 drug_eras belong to different observation periods (let’s say from 2009-01-01 to 2010-05-10
and from 2010-05-15 to 2012 - 01 -01) then default Cohort Exit Criteria will still return 2 rows
(they will be 2010-01-01 - 2010-05-10 and 2010-05-25 - 2012-01-01 correspondingly).

And in this case Collapsing strategy (with sufficient gap size) will union it to one row:
2010-01-01 - 2012-01-01
The trick is that Collapsing strategy doesn’t take into account actual observation periods but operates with only pre-generated timelines

Well, the cohort start and end dates can become your time at risk in the Estimation module in Atlas, and it can be your time at risk window inside the incidence rate calculator. If you have an outcome cohort, the outcome will only be identified if it starts within the cohort start and end dates of your Target cohort.

1 Like
t