Condition ERA in Hadoop - Different Results

areckard · April 4, 2017, 4:42pm

Has anyone successfully processed the Condition ERA table in Hadoop? We are having an issue with getting the same results from our ETL application when compared to ad hoc queries in Hive SQL Editor in HUE.

Just a little background; we ran into issues running the community code as is via Spark SQL. So we had to refactor the code. Because of this we decided to run this through a thorough QA cycle to make sure the refactoring didn’t break anything. As it turns out QA found many records that should have been in the table that weren’t.

After extensive research we’ve been able to isolate the issue to the window function for the “start_ordinal” value.

cteEndDates AS – the magic
(
SELECT
person_id AS person_id,
condition_concept_id AS condition_concept_id,
to_date(date_add(event_date, -30)) AS end_date – unpad the end date
FROM
(
SELECT
person_id AS person_id,
condition_concept_id AS condition_concept_id,
event_date AS event_date,
event_type AS event_type,
MAX(start_ordinal) OVER (PARTITION BY person_id, condition_concept_id
ORDER BY event_date, event_type
ROWS UNBOUNDED PRECEDING) AS start_ordinal,

Here are results side by side…

Because the “start_ordinal” value is different when processed via Spark we are always off when it comes to the QA review.

Has anyone encountered this before?

Thanks for your time and help.

Anthony

Chris_Knoll · April 4, 2017, 11:12pm

Hey, @areckard,
I suspect it has to do with the HIVE implementation not including the current row in the ROWS UNBOUNDED PRECEEDING)…but there’s some tests you can do:

I wrote this and executed it in MSSQL to see how the max() over () produces the result, here’s the test query:

with cteConditionTarget (condition_occurrence_id, person_id, condition_concept_id, condition_start_date, condition_end_date) as
(
	select 1 as condition_occurrence_id, 29335874 as person_id, 26662 as condition_concept_id, '2013-11-27' as condition_start_date, '2013-11-27' as condition_end_date
	UNION ALL
	select 2 as condition_occurrence_id, 29335874 as person_id, 26662 as condition_concept_id, '2013-11-27' as condition_start_date, '2013-11-27' as condition_end_date
)
select PERSON_ID, CONDITION_CONCEPT_ID, EVENT_DATE, EVENT_TYPE, 
START_ORDINAL as RAW_START_ORDINAL,
MAX(START_ORDINAL) OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID ORDER BY EVENT_DATE, EVENT_TYPE ROWS UNBOUNDED PRECEDING) as START_ORDINAL, -- this pulls the current START down from the prior rows so that the NULLs from the END DATES will contain a value we can compare with 
ROW_NUMBER() OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID ORDER BY EVENT_DATE, EVENT_TYPE) AS OVERALL_ORD -- this re-numbers the inner UNION so all rows are numbered ordered by the event date
from
(
	-- select the start dates, assigning a row number to each
	Select PERSON_ID, CONDITION_CONCEPT_ID, CONDITION_START_DATE AS EVENT_DATE, -1 as EVENT_TYPE, ROW_NUMBER() OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID ORDER BY CONDITION_START_DATE) as START_ORDINAL
	from cteConditionTarget
		
	UNION ALL
		
	-- pad the end dates by 30 to allow a grace period for overlapping ranges.
	select PERSON_ID, CONDITION_CONCEPT_ID, DATEADD(day,30,CONDITION_END_DATE), 1 as EVENT_TYPE, NULL
	FROM cteConditionTarget
) RAWDATA

The results look like this:

PERSON_ID	CONDITION_CONCEPT_ID	EVENT_DATE	EVENT_TYPE	RAW_START_ORDINAL	START_ORDINAL	OVERALL_ORD
29335874	26662	2013-11-27 00:00:00.000	-1	1	1	1
29335874	26662	2013-11-27 00:00:00.000	-1	2	2	2
29335874	26662	2013-12-27 00:00:00.000	1	NULL	2	3
29335874	26662	2013-12-27 00:00:00.000	1	NULL	2	4

What should happen in the query is that the MAX() over() should go through each row and set the MAX_START_ORDINAL column to the max value of the start_ordinal value using the prior rows, including the current. As you can see from the ordering of the output above, where the RAW_START_ORDINAL is 1, the MAX(START_ORDINAL) should result in 1 because there’s only the 1 row to work with.

It looks like HIVE is using the entire result set worth of rows, and finding the max() start_ordinal is 2 and returning that.

There is an alternative to using MAX() over () ROWS UNBOUND PRECEDING in cases where it doesn’t work (we had to do this in our internal environment) found at this gist: https://gist.github.com/chrisknoll/8d3c6744bae4f060aec1

Replace the cteEndDates with this query:

cteEndDates (PERSON_ID, END_DATE) as -- the magic
(
	select PERSON_ID, DATEADD(day,-30,EVENT_DATE) as END_DATE -- unpad the end date
	FROM
	(
		select E1.PERSON_ID, E1.CONDITION_CONCEPT_ID, E1.EVENT_DATE, COALESCE(E1.START_ORDINAL,MAX(E2.START_ORDINAL)) START_ORDINAL, E1.OVERALL_ORD 
		FROM 
		(
			select PERSON_ID, CONDITION_CONCEPT_ID,  EVENT_DATE, EVENT_TYPE, 
			START_ORDINAL,
			ROW_NUMBER() OVER (PARTITION BY PERSON_ID, CONDITION_CONCEPT_ID  ORDER BY EVENT_DATE, EVENT_TYPE) AS OVERALL_ORD -- this re-numbers the inner UNION so all rows are numbered ordered by the event date
			from
			(
				-- select the start dates, assigning a row number to each
				Select PERSON_ID, CONDITION_CONCEPT_ID, CONDITION_OCCURRENCE_START_DATE AS EVENT_DATE, 0 as EVENT_TYPE, ROW_NUMBER() OVER (PARTITION BY PERSON_ID ORDER BY CONDITION_OCCURRENCE_START_DATE) as START_ORDINAL
				from cteConditionTarget 
		
				UNION ALL
		
				-- add the end dates with NULL as the row number, padding the end dates by 30 to allow a grace period for overlapping ranges.
				select PERSON_ID, CONDITION_CONCEPT_ID, DATEADD(day,30,CONDITION_OCCURRENCE_END_DATE), 1 as EVENT_TYPE, NULL
				FROM cteConditionTarget 
			) RAWDATA
		) E1
		JOIN (
			Select PERSON_ID, CONDITION_CONCEPT_ID, CONDITION_OCCURRENCE_START_DATE AS EVENT_DATE, ROW_NUMBER() OVER (PARTITION BY PERSON_ID ORDER BY CONDITION_OCCURRENCE_START_DATE) as START_ORDINAL
			from cteConditionTarget 
		) E2 ON E1.PERSON_ID = E2.PERSON_ID AND E1.CONDITION_CONCEPT_ID = E2.CONDITION_CONCEPT_ID AND E2.EVENT_DATE <= E1.EVENT_DATE
		GROUP BY E1.PERSON_ID, E1.CONDITION_CONCEPT_ID, E1.EVENT_DATE, E1.START_ORDINAL, E1.OVERALL_ORD
	) E
	WHERE 2 * E.START_ORDINAL - E.OVERALL_ORD = 0
)

The gist i pointed you to has a deep explanation about how the era building logic works, although it’s using data from a drug exposure sample and not partitioning by the concept_id (the sample data only has 1 concept_id in it)…so I had to make some manual edits from that gist for this post, so apologies for the typos!

The idea is to replace the MAX() over ROWS PRECEEDING with a join on the event dates where the join’d table (E2) event dates are <= the event date of the rows from the E1 table. Then using a group by clause we can get the max(start_ordinal) from the prior rows.

-Chris

areckard · April 5, 2017, 12:26am

Hi Chris,

Thanks so much! I will try this first thing tomorrow morning.

And I really appreciate the quick response. I hope to do the same for someone in the future.

Anthony