OHDSI Home | Forums | Wiki | Github

Making sense of date offsets in generated SQL

webapi
cdm

(Clark C. Evans) #1

Hello. I’m trying to grok the SQL generated in the Atlas interface, and in particular, how date offsets for period inclusion is done for various kinds of queries.

So, I figure the answer to the 1st is that the logic is to be “exclusive” logic, where the end_date is often the day after, as hinted at by this 2015 thread. However, I don’t see the corresponding logic I’d expect, where one would use < or a -1 adjustment in the filter, rather than an inclusive logic, <=.

In the second question, perhaps the answer has something to do with only wanting to track when the drug was prescribed rather than the whole window in which it was delivered? In this case though, why let the end_date be used since it could have been filled in via ETL as including the entire prescription period. In either case, it seems that there could be inconsistency as to how the query is interpreted depending upon how the data is loaded.

Generally, I’m still looking for formal documentation that describes exactly how dates are treated and corresponding regression test cases. Would someone who is familiar with the SQL generation please enlighten me? Thanks!


(Chris Knoll) #2

Hello, @cce, I can answer questions related to CIRCE expressions.

In both bullet points, you’re referring to how the event duration is calculated. If there is no end date, then it uses start_date + 1 (which would create the interval that spans the entire day). When dealing with the time between, it includes the start date, but goes up to, but does not include, the end date. I think it is a bug that the drug criteria doesn’t attempt to use the days supply do infer an end date (it goes end_date, else start_date + 1). The code you saw that was applying the days_supply was when it was building a custom drug era. But, to be sure, both cases should be applying the days_supply in the event there is no end date. Personally, I wish that the THEMIS convention was that the ETL must guarantee an end date in the record, and I would always just use that. But, to be more robust, it should try end_date -> days_supply -> start+1.

I’ve been giving some thought on how the ‘window criteria’ dates are handled, and also with an eye of creating test cases so that we can certify the correct behavior. In thiink the current implementation may have some minor problems, and this thread is a great place to discuss it.

So, let me give you the logic and reasoning behind how date offsets are used:

When we look for correlated criteria using a window (ie: having the event start between 90d before and 0 days before index or from 0 days after to 5 days after index). we’re simply adding/subtracting a number of days from the index event, and then looking for events that start between the start date and end date, based on the offset. Example:
index event is 1/1/2010 and you want to find an event that started between 0 days after and 7 days after. The dates we will use are 1/1/2010 + 0d (1/1/2010) and 1/1/2010 + 7d (1/8/2010). The query should look for the events as >= 1/1/2010 and < 1/8/2010 (because that will cover all time up to but not including 1/8/2010, for a grand total of 7 days).

If you look at the code, it doesn’t quite work that way (which I think there is an issue). The query will generate the following join condition:

AND A.START_DATE >= DATEADD(day,0,P.START_DATE) AND A.START_DATE <= DATEADD(day,7,P.START_DATE)

You can see the problem is that it uses <= dateadd(day,7,p.start_date) as the upper bound, which will capture 1/8/2001. It should capture events up to but not including 1/8/2001. So, that’s a bug. But, when dealing with soemthing like ‘between 7 days before and 0 days after’, this logic (using <=) will capture the index date properly because you’ll get an expression:

AND A.START_DATE >= DATEADD(day,0,P.START_DATE) AND A.START_DATE <= DATEADD(day,0,P.START_DATE)

which will properly capture the day of index. while if we used < dateadd(day,DATEADD(day,0,P.START_DATE), it wouldn’t match anything because there is no values that can be found where x >= 1/1/2001 and x < 1/1/2001.

However, while this captures 1/1/2001 exactly, if the data had a hour component like 1/1/2001 12:15pm, this event would not be captured, (because 12:15pm is not contained with a timewindow between 1/1/2001 (which infers 12:00am) and 1/1/2001 (again, inferring 12:00am). This wasn’t a problem when the CDM didn’t capture hours/minutes, but it does turn into a problem when it does.

So, with all that in mind, let’s talk about how I’d like to see the mechanics of the window criteria work:

I’d like to be able to state the window of time as from x days before to y days after, where x days before means index - x days and y days after means index + y days.

I think this makes it easier to understand: i want to say the time period from now to 1 week later (7 days) and ‘now’ is 1/1/2010, then the timespan covers all time starting at 1/1/2010 and ends at 1/8/2010 (but does not include 1/8/2010. That’s why I think the current bug is that our upper bound after adding the offset is using <= instead of <.

Unfortunately, the behavior we’ve been teaching people is that if they just want to include the same day in their index, we’ve been telling them to do things like 7 days before and 0 days after index. If we make the upper bound behave as ‘exclusive’, then the proper way to indicate to include the index is 7 days before and 1 day after index. (such that if the index is 1/1/2010, then 1 day after is 1/2/2010, and the interval will end with < 1/2/2010, which includes all time in 1/1/2010).

So, part of my hesitation with dealing with this issue is the current pattern of behavior that people have with the tool. But I think it is much more important that people get the right results. So, let’s discuss the specific issue of defining those window of dates and how that should translate into SQL logic, and if we can all come to a consensus about how it should operate, I can adjust the code accordingly and create tests to ensure that behavior is performed.


(Clark C. Evans) #3

Chris, Thank you so much for responding. I’ve got another question for you about the generated SQL.

  • For generating the cohort_exit the strategy ends logic might to want to use a maximum between the end_date and the start_date plus 7, but the expression is a noop instead. This might be just fine, depending upon intent.

For background, I’m replicating cohort queries using DataKnots.jl. In particular, I’ve just now completed the logic for cohort 1770674 - Acute myocardial infarction events. This logic works against a very minimalist PostgreSQL SynPuf I’ve made, SynPuf-10P that has dramatically pruned vocabulary necessary to only implement logic specified in the BookOfOHDSI and hence I view it as “fair use” of SNOMED, etc. I’ve used Atlas to capture the textual, json and sql variants of various cohorts defined during the F2F meeting, and recorded them in this same repository.

So far, only 1770674 is converted into a DataKnot query. This relies only upon a copybook that defines the very beginning of a Domain Specific Query Language (DSQL) specific to OHDSI. As I build a DSQL for OHDSI, and translate cohort queries in the BookOfOHDSI, it’d be nice to know what the official logic is so that I could implement it as part of a library, OHDSI.jl which would be used by Julia people to construct cohorts, replicating functionality found in WebAPI and other modules. As you can see from the cohort query definition, it’s relatively succinct. At this time, the interval inclusion/during logic is defined inclusively, as shown in the SQL. I actually think it’s good logic. I’m not sure if one would even want to move over to exclusive interval testing… it has it’s own downsides and can be even more confusing with 1 day events.

There is much to be said about requiring this ambiguity to be resolved at ETL time. During ETL is when you know what upstream data source interpretation means, and it might mean different things depending upon where each kind of event is captured. Hence, having simple inclusive logic and making end_date mandatory may be smart.


(Chris Knoll) #4

Hi, @cce
I very much appreciate people looking into this! It is very educational.

For the cohort_exit strategy, there might be an issue there. The idea of that type of strategy is to just set the cohort exit based on an offset. The logic:
case when (start_date + 7*INTERVAL'1 day') > start_date then (start_date + 7*INTERVAL'1 day') else start_date end as end_date
The logic of this is to just ensure that the offset occurs after the start date, however the else case should be start_date + 1 day (so that we don’t have 0 length cohort eras).It’s not the max of the end_date and start_date +7 because we’re ignoring the event’s end_date in this case: we simply want to assign the cohort end to be an offset from the event start_date. if that pushes it past the event’s end date, that is OK. later on it is supposed to take the earlist of the strategy end date vs the observation_period_end_date of the event (because you don’t let the date offset push past the observation period).

Again, i think you’ve identified a bug: this is the query that is supopsed to bring all the end dates for each event:

WITH cohort_ends (event_id, person_id, end_date) AS ( -- cohort exit dates -- End Date Strategy SELECT event_id, person_id, end_date from strategy_ends
This CTE should have contained the derived end date from the strategy unioned with the observation period end date of each event. For some reason this is missing, and I could have sworn it was there at one point. I will open an issue on this and correct it.

In fact, i just checked git, and there is an open issue for this here.

I’m not familiar with DataKnot or any efforts to create a “domain specific query language”, so I’ll leave that to the experts, but I appreciate you reviewing the queries ans asking questions about the logic.

I’ll open an issue related to handling the time intervals in that repo so that we can get some community feedback on how that’s handled.

-Chris


(Clark C. Evans) #5

@Chris_Knoll I’m just poking, nothing more. Anyway, we could probably use the very pruned database I’ve made as the basis of a regression test (unless there’s a better one). The repository has CSV files for each table (CDMv5). Since the vocabulary is only enough to support the use-cases I’m testing with, replication of cohort queries in the Book of OHDSI, it is is a very minimalistic subset of the vocabulary. Hence, I believe it falls under fair use to be publicly disclosed.

So, it’d be great if we could come up with a shared definition of what these cohort queries are, and how they should be interpreted. If this sample database (or some other sample) could be input, we could list expected outputs, etc. Furthermore, there is no reason why we couldn’t add dummy data to catch edge cases that are interesting.

As for DataKnots, it’s my own initiative. I just see OHDSI as a lovely case since this community has complex queries and having a more formal way to specify the queries may be advantageous. Currently DataKnots is quite slow when it comes to SQL processing (it lacks any sort of SQL push-down), but that’ll happen soon enough. Right now the queries work, which is good enough for me to validate semantics.

P.S. OK. I looked at that bug, and fixed the corresponding version of the cohort query, here is the commit.

I would also look into how the nomenclature used could be brought into alignment with HL7’s Clinical Quality Language, in particular timing phrases and interval operators and interval calculations.


(Chris Knoll) #6

Poking makes the code better! Keep it up!

I’ve opened the git issue discussing the criteria windows here. I’ll make the ‘date offset strategy’ fix a priority. Thanks very much for your review.

-Chris


t