Hey, @areckard,
I suspect it has to do with the HIVE implementation not including the current row in the ROWS UNBOUNDED PRECEEDING)…but there’s some tests you can do:
I wrote this and executed it in MSSQL to see how the max() over () produces the result, here’s the test query:
with cteConditionTarget (condition_occurrence_id, person_id, condition_concept_id, condition_start_date, condition_end_date) as
(
select 1 as condition_occurrence_id, 29335874 as person_id, 26662 as condition_concept_id, '2013-11-27' as condition_start_date, '2013-11-27' as condition_end_date
UNION ALL
select 2 as condition_occurrence_id, 29335874 as person_id, 26662 as condition_concept_id, '2013-11-27' as condition_start_date, '2013-11-27' as condition_end_date
)
select PERSON_ID, CONDITION_CONCEPT_ID, EVENT_DATE, EVENT_TYPE,
START_ORDINAL as RAW_START_ORDINAL,
MAX(START_ORDINAL) OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID ORDER BY EVENT_DATE, EVENT_TYPE ROWS UNBOUNDED PRECEDING) as START_ORDINAL, -- this pulls the current START down from the prior rows so that the NULLs from the END DATES will contain a value we can compare with
ROW_NUMBER() OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID ORDER BY EVENT_DATE, EVENT_TYPE) AS OVERALL_ORD -- this re-numbers the inner UNION so all rows are numbered ordered by the event date
from
(
-- select the start dates, assigning a row number to each
Select PERSON_ID, CONDITION_CONCEPT_ID, CONDITION_START_DATE AS EVENT_DATE, -1 as EVENT_TYPE, ROW_NUMBER() OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID ORDER BY CONDITION_START_DATE) as START_ORDINAL
from cteConditionTarget
UNION ALL
-- pad the end dates by 30 to allow a grace period for overlapping ranges.
select PERSON_ID, CONDITION_CONCEPT_ID, DATEADD(day,30,CONDITION_END_DATE), 1 as EVENT_TYPE, NULL
FROM cteConditionTarget
) RAWDATA
The results look like this:
PERSON_ID CONDITION_CONCEPT_ID EVENT_DATE EVENT_TYPE RAW_START_ORDINAL START_ORDINAL OVERALL_ORD
29335874 26662 2013-11-27 00:00:00.000 -1 1 1 1
29335874 26662 2013-11-27 00:00:00.000 -1 2 2 2
29335874 26662 2013-12-27 00:00:00.000 1 NULL 2 3
29335874 26662 2013-12-27 00:00:00.000 1 NULL 2 4
What should happen in the query is that the MAX() over() should go through each row and set the MAX_START_ORDINAL column to the max value of the start_ordinal value using the prior rows, including the current. As you can see from the ordering of the output above, where the RAW_START_ORDINAL is 1, the MAX(START_ORDINAL) should result in 1 because there’s only the 1 row to work with.
It looks like HIVE is using the entire result set worth of rows, and finding the max() start_ordinal is 2 and returning that.
There is an alternative to using MAX() over () ROWS UNBOUND PRECEDING in cases where it doesn’t work (we had to do this in our internal environment) found at this gist: https://gist.github.com/chrisknoll/8d3c6744bae4f060aec1
Replace the cteEndDates with this query:
cteEndDates (PERSON_ID, END_DATE) as -- the magic
(
select PERSON_ID, DATEADD(day,-30,EVENT_DATE) as END_DATE -- unpad the end date
FROM
(
select E1.PERSON_ID, E1.CONDITION_CONCEPT_ID, E1.EVENT_DATE, COALESCE(E1.START_ORDINAL,MAX(E2.START_ORDINAL)) START_ORDINAL, E1.OVERALL_ORD
FROM
(
select PERSON_ID, CONDITION_CONCEPT_ID, EVENT_DATE, EVENT_TYPE,
START_ORDINAL,
ROW_NUMBER() OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID ORDER BY EVENT_DATE, EVENT_TYPE) AS OVERALL_ORD -- this re-numbers the inner UNION so all rows are numbered ordered by the event date
from
(
-- select the start dates, assigning a row number to each
Select PERSON_ID, CONDITION_CONCEPT_ID, CONDITION_OCCURRENCE_START_DATE AS EVENT_DATE, 0 as EVENT_TYPE, ROW_NUMBER() OVER (PARTITION BY PERSON_ID ORDER BY CONDITION_OCCURRENCE_START_DATE) as START_ORDINAL
from cteConditionTarget
UNION ALL
-- add the end dates with NULL as the row number, padding the end dates by 30 to allow a grace period for overlapping ranges.
select PERSON_ID, CONDITION_CONCEPT_ID, DATEADD(day,30,CONDITION_OCCURRENCE_END_DATE), 1 as EVENT_TYPE, NULL
FROM cteConditionTarget
) RAWDATA
) E1
JOIN (
Select PERSON_ID, CONDITION_CONCEPT_ID, CONDITION_OCCURRENCE_START_DATE AS EVENT_DATE, ROW_NUMBER() OVER (PARTITION BY PERSON_ID ORDER BY CONDITION_OCCURRENCE_START_DATE) as START_ORDINAL
from cteConditionTarget
) E2 ON E1.PERSON_ID = E2.PERSON_ID AND E1.CONDITION_CONCEPT_ID = E2.CONDITION_CONCEPT_ID AND E2.EVENT_DATE <= E1.EVENT_DATE
GROUP BY E1.PERSON_ID, E1.CONDITION_CONCEPT_ID, E1.EVENT_DATE, E1.START_ORDINAL, E1.OVERALL_ORD
) E
WHERE 2 * E.START_ORDINAL - E.OVERALL_ORD = 0
)
The gist i pointed you to has a deep explanation about how the era building logic works, although it’s using data from a drug exposure sample and not partitioning by the concept_id (the sample data only has 1 concept_id in it)…so I had to make some manual edits from that gist for this post, so apologies for the typos!
The idea is to replace the MAX() over ROWS PRECEEDING with a join on the event dates where the join’d table (E2) event dates are <= the event date of the rows from the E1 table. Then using a group by clause we can get the max(start_ordinal) from the prior rows.
-Chris