What is "the magic" in cohort definitions

hripcsa · June 8, 2018, 12:34pm

Sorry for the naive question, but I was asked the following and cannot answer it. Atlas produces SQL code from the GUI-based cohort definition. There is a section labeled “the magic” in the SQL output. What does that section do? It appears to be part of assembling the final answer.

Eldar · June 8, 2018, 1:20pm

Short answer is : this is the part of script that is responsible for Cohort Collapse Strategy

Chris_Knoll · June 8, 2018, 1:46pm

The ‘magic’ comment in the code is a sort of ‘personal signature’ I left in the code just to make it distinctive.

This question comes up a bit, so I’ll provide some detail here:

The ‘magic’ is the algorithm that can take a set of start and end dates and return which of those dates are the dates when all starts have been ended, accounting for overlapping periods. So if we have 4 date ranges that produce this timeline:

|-----------|
      |---------|
   |--------------|
                         |--------|

This plot might would look like this in the data:

start_date, end_date
2010-01-01,  2010-06-01
2010-03-01,  2010-07-01
2010-02-01,  2010-08-01
2010-11-01, 2010-12-01

We can see easily that there’s 2 ‘eras’ here with a gap, spanning 2010-01-01 to 2010-08-01 and finally 2010-11-01 through 2010-12-01.

The ‘magic’ algorithm arranges the start and end dates in order, such that you have 3 columns: the eventDate, start_ord and overall_ord where start_ord is the ordinal of the start date and overall_ord is the row number of the set of dates. From the above sample data, it looks like this:

event_date, start_ord, overall_ord
2010-01-01, 1, 1
2010-02-01, 2, 2
2010-03-01, 3, 3
2010-06-01, 3, 4
2010-07-01, 3, 5
2010-08-01, 3, 6
2010-11-01, 4, 7
2010-12-01, 4, 8

(see how overall ord is the row number of each date, while start ord only is the row number from the list of start dates).

From this arrangement, you can see that the row that has 2 * start_ord = overall_ord (or, rewritten: 2* start_ord - overall_ord = 0) means that that date is an end date of an era. Using those ‘magic dates’ you can then find the earliest start_date that belongs to the given end date, and those dates are your eras. In this case it would be:

era_start. era_end
2010-01-01, 2010-08-01
2010-11-01, 2010-12-01

It’s like magic!

-Chris

Gowtham_Rao · June 8, 2018, 2:25pm

@hripcsa

I have spent an entire weekend learning and admiring this code snippet from @chris_knoll . It is complex, elegant and efficient all at the same time. It’s truly amazing. It deserves to be called magic.

mgkahn · June 8, 2018, 5:56pm

I totally agree with @Gowtham_Rao!! A summer student and I were head-scratching on EXACTLY that part of the cohort definition code for the treatment pathways project. I NEVER, EVER, EVER, EVER, EVER would have figured it out w/o @Chris_Knoll’s posting.

More seriously, Chris – did you figure this out yourself or does this come from some Hogwarts School of Witchcraft and Wizardry book…

hripcsa · June 8, 2018, 6:50pm

Awesome! Thanks so much for the tutorial and very neat.

Chris_Knoll · June 8, 2018, 10:39pm

Yes, I implemented the same sort of strategy in the treatment pathways algorithm. When you need to transform a pattern of A-B-B-B-C-B-B-C-C-C-D into A-B-C-B-C-D, this technique can be applied.

The approach was derived from some research I did on solving the problem of finding the end of eras from this content:

-- Adapted from calculating concurrent sessions query found at http://sqlmag.com/t-sql/calculating-concurrent-sessions-part-3
-- Credits to Ben Flanaghan, Arnold Fribble, and R. Barry Young

They were trying to solve the problem of 'how many sessions were concurrently active. I flipped it around to say when are there no sessions active. I think I improved on the work, however, but I’ll let the other readers bet he judge.

I’ve attributed that work in my gist found here:

gist.github.com

https://gist.github.com/chrisknoll/8d3c6744bae4f060aec1

drug_exposure_eras_demo_NoUnboundPreceeding

-- Adapted from calculating concurrent sessions query found at http://sqlmag.com/t-sql/calculating-concurrent-sessions-part-3
-- Credits to Ben Flanaghan, Arnold Fribble, and R. Barry Young

-- DROP TABLE #DRUG_EXPOSURE;

CREATE TABLE #DRUG_EXPOSURE
(
	[DRUG_EXPOSURE_ID] [bigint] NOT NULL,
	[PERSON_ID] [bigint] NOT NULL,
	[DRUG_CONCEPT_ID] [int] NOT NULL,

This file has been truncated. show original

-Chris

Chris_Knoll · June 8, 2018, 10:53pm

I guess I shouldn’t leave that treatment pathway statment hanging, so let me demonstrate how that works:

Given the pattern:
A-B-B-B-C-B-B-C-C-C-D
and you need to remove repeats, and reduce it to:
A-B-C-B-C-D

Step 1: rownumber the events:

element, rn
A, 1
B, 2
B, 3
B, 4
C, 5
B, 6
B, 7
C, 8
C, 9
C, 10
D, 11

Next, add a rownumber, but partition by element to the result

element, rn, p_rn
A, 1, 1
B, 2, 1
B, 3, 2
B, 4, 3
C, 5, 1
B, 6, 4
B, 7, 5
C, 8, 2
C, 9, 3
C, 10. 4
D, 11, 1

Notice that duplicates get grouped when you subtract rownum - partitionrow:

element, rn, p_rn, rn-p_rn
A, 1, 1, 0
B, 2, 1, 1
B, 3, 2, 1
B, 4, 3, 1
C, 5, 1, 4
B, 6, 4, 2
B, 7, 5, 2
C, 8, 2, 6
C, 9, 3, 6
C, 10. 4, 6
D, 11, 1, 10

Now if you select element, min(rn) FROM … group by element, rn-p_rn you get:

A, 1
B, 2
C, 5
B, 6
C, 8
D, 11

Ordering by the initial RN, you get
A-B-C-B-C-D

-Chris

MarkSamuelTuttle · June 8, 2018, 11:02pm

Chris, your implementation here is an application of a sound software engineering principle - build your solution into data, rather than algorithm. Also, it’s a wonderful example of “relational” programming. My hat is off. Note that application of this principle does not prevent hiding the whole thing in algorithm later, as you imply.

Gowtham_Rao · June 8, 2018, 11:07pm

Amazing!

ericaVoss · June 27, 2018, 7:14pm

@Chris_Knoll

Question, since this “Cohort Collapse Strategy” is in ATLAS, on the implementation screen it looks like it is always trying to apply it with a default gap day of 0.

But then the Print Friendly says:

If it is set to 0 is it selecting the end of the OBSERVATION_PERIOD_END_DATE or still attempting to collapse with a gap size = 0?

Chris_Knoll · June 27, 2018, 7:26pm

Once the individual cohort episodes are identified (starting with initial events, passing through inclusion criteria, and finally applying exit strategy), you may be left with cohort episodes that overlap. Since overlapping episodes is forbidden, we always collapse the individual cohort episodes into non-overlapping episodes. By default, we allow 0 days of gap between cohort episodes. You can adjust it using the ‘collapse gap size’.

Eldar · June 27, 2018, 7:32pm

I’m not certain. But it looks like the logic is the next:
Firstly you specify Cohort Exit Criteria where cohort_end_date is defined
And only then the logic of collapsing rows is applied.
Thus ‘Cohort Collapse strategy’ doesn’t touch cohort_end_dates itself.
I.e: you can have 2 or more records for subject with different start_dates (as well as end_dates).
If you decided to collapse them - then these 2 rows will transform into 1 with min(cohort_start) and max(cohort_end)
If no - they will both stay as they are.

Chris_Knoll · June 27, 2018, 7:32pm

The collapse strategy language may be missing from the print friendly, but your editor screen capture is about the collapse strategy and your print friendly screen capture is referencing the end date strategy.

End Date Strategy is use to specify when, after the initial event passes inclusion criteria, how long you should consider the event’s episode before the episode ends. By default, if you don’t specify how an initial event should end it’s episode, then the episode will be defined as from the event start to the end of the event’s observation period.

Chris_Knoll · June 27, 2018, 7:35pm

This is correct, the collapse strategy won’t generate any new end dates. But, if there are multiple episodes that overlap, then you can consider the ‘collapsed episode’ to be the MIN(start_date) and MAX(end_date) of all episodes that are overlapping each other.

@Eldar: I just wanted to add: you don’t really have the choice to collapse them. We always do because we must ensure that there are no overlapping episodes in the cohort results. You can decide to change the ‘allowable gap’ between episodes to bring separate episodes together within a gap, but the logic must always remove overlapping episodes.

Eldar · June 27, 2018, 7:36pm

So now I’m not uncertain.
Thank you

ericaVoss · June 27, 2018, 7:51pm

Thanks @Chris_Knoll - if I read more carefully I would have seen I was messing up “Cohort Collapse Strategy” and “End Date Strategy”. If my eye traveled down only 100 more pixels I would have seen:

ericaVoss · June 27, 2018, 7:56pm

@Chris_Knoll could you give me an example of where I would use this feature. I’m sorry, I know you told me before and now I’m blanking. I was thinking COHORT_END_DATE and that is obviously not right.

Chris_Knoll · June 27, 2018, 9:16pm

if you wanted to identify episodes of backpain as a moment that you are diagnosed with backpain, and you assume that if it continues, the’ll have a return visit within 30d with backpain. So you would define your intial events as ‘condition occurrence of ‘Backpain’’, with an exit streategy of ‘fixed date of 30d after event date’. This is just saying ‘Backpain lasts 30 days’. Now if they come in < 30 days for another visit with backpain, you’ll extend the backpain episode another 30d.

What if they come in on day 32? It would be a new episode with the default 0d gap. If you set the collapse gap to 14d then backpain episodes within 14d of each other will collapse into a single back pain episode starting with the first overlapping episode and ending at the end of the last episode (30d after the back pain diagnosis).

ericaVoss · June 27, 2018, 9:23pm

Once I have that collapsed time of back pain how/where can I use it in ATLAS?