OHDSI Home | Forums | Wiki | Github

Drug Era script took too long to run

Hello

We let the drug era script run for over 2 days, but it still didn’t finish. We have CDM v5.4, the drug_exposure table has around 214 million records, all the table index were built. Is there anything we missed? What else can we do to improve the performance?

Thank you for your help.
Jack

I saw in your other thread the era-builder query, and while it does work, it’s based on an older form where we wanted to support platforms where there aren’t window functions. To overcome that limitation, we result in a self-join which could lead to performance problems.

Can you try this form:

with cteEndDates (person_id, condition_concept_id, end_date) AS
(
	SELECT
		person_id,
		condition_concept_id,
		DATEADD(day,-30, event_date) as end_date
	FROM
	(
		SELECT person_id,
			condition_concept_id,
			event_date,
			SUM(event_type) OVER (PARTITION BY person_id, condition_concept_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
		FROM
		(
			SELECT
				person_id,
				condition_concept_id,
				condition_start_date AS event_date,
				-1 AS event_type
			FROM @TARGET_CDMV5_SCHEMA.condition_occurrence

			UNION ALL

			SELECT
				person_id,
				condition_concept_id,
				DATEADD(day, 30, COALESCE(condition_end_date, DATEADD(day, 1, condition_start_date))) AS event_date,
				1 AS event_type
			FROM @TARGET_CDMV5_SCHEMA.condition_occurrence
		) RAWDATA
	) e
	WHERE interval_status = 0
),
cteEnds (person_id, condition_concept_id, condition_start_date, condition_end_date) AS
(
	SELECT co.person_id,
		co.condition_concept_id,
		co.condition_start_date as start_date,
		MIN(e.end_date) AS condition_end_date
	FROM @TARGET_CDMV5_SCHEMA.condition_occurrence co
	INNER JOIN cteEndDates e ON co.person_id = e.person_id
		AND co.condition_concept_id = e.condition_concept_id
		AND e.end_date >= co.condition_start_date
	GROUP BY  co.person_id,
		co.condition_concept_id,
		co.condition_start_date
)
select top 10 person_id, condition_concept_id, min(condition_start_date) as condition_start_date, condition_end_date, count(*) as condition_occurrence_count
from cteEnds
group by person_id, condition_concept_id, condition_end_date
;

Some notes on this:

  • I followed the original query for the variable name of the CDM schema: @TARGET_CDMV5_SCHEMA
  • It’s not clear what database platform you are running on, but it might be more performant to split thsi query into separate temp tables. If you could provide some of your DBMS details, that would be helpful.
  • The output of this query just selects 10 rows, but it had to process the entire dataset to produce those rows, so you can use this form to test performance, but in the end you’ll have to modify it to write the results into your target table.

It’s hard to give more guidance without knowing what database you are running this on. I tested this in my local env on Redshift with 7,020,610,394 rows in the condition occurrence table and it finished in about 15 minutes. We have a beefy configuration tho, so don’t expect to get the same results. For your own environment, you might want to tune the available memory, since the era building logic needs to scan the entire table and perform a large sort operation, so other than indexes that would apply a sort order on the table by person-condtiion_concept_id-start_date, there’s not much more you can do to speed it up.

@Chris_Knoll

Our OMOP database is hosted on SQL Server. The script you posted is for Condition Era. The condition_occurrence table in our OMOP has 240,770,047 rows, the condition era script can be completed in a reasonable amount of time. It is the drug era script that we are struggle with. Do you have revised drug era script?

Thanks
Jack

Sure, you can use this:

with cteDrugTarget (DRUG_EXPOSURE_ID, PERSON_ID, DRUG_CONCEPT_ID, DRUG_EXPOSURE_START_DATE, DRUG_EXPOSURE_END_DATE) as
(
-- Normalize DRUG_EXPOSURE_END_DATE to either the existing drug exposure end date, or add days supply, or add 1 day to the start date
	select DRUG_EXPOSURE_ID, PERSON_ID, DRUG_CONCEPT_ID, DRUG_EXPOSURE_START_DATE, 
		COALESCE(DRUG_EXPOSURE_END_DATE, DATEADD(day,DAYS_SUPPLY,DRUG_EXPOSURE_START_DATE), DATEADD(day,1,DRUG_EXPOSURE_START_DATE)) as DRUG_EXPOSURE_END_DATE
	FROM @TARGET_CDMV5_SCHEMA.DRUG_EXPOSURE de
		join @TARGET_CDMV5_SCHEMA.CONCEPT_ANCESTOR ca on ca.DESCENDANT_CONCEPT_ID = de.DRUG_CONCEPT_ID
		join @TARGET_CDMV5_SCHEMA.CONCEPT c on ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
		where c.VOCABULARY_ID = 'RxNorm'
		and c.CONCEPT_CLASS_ID = 'Ingredient'
),
cteEndDates (person_id, drug_concept_id, end_date) AS
(
	SELECT
		person_id,
		drug_concept_id,
		DATEADD(day,-30, event_date) as end_date
	FROM
	(
		SELECT person_id,
			drug_concept_id,
			event_date,
			SUM(event_type) OVER (PARTITION BY person_id, drug_concept_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
		FROM
		(
			SELECT
				person_id,
				drug_concept_id,
				drug_exposure_start_date AS event_date,
				-1 AS event_type
			FROM cteDrugTarget

			UNION ALL

			SELECT
				person_id,
				drug_concept_id,
				DATEADD(day, 30, drug_exposure_end_date) AS event_date,
				1 AS event_type
			FROM cteDrugTarget
		) RAWDATA
	) e
	WHERE interval_status = 0
),
cteEnds (person_id, drug_exposure_id, drug_concept_id, drug_exposure_start_date, drug_exposure_end_date) AS
(
	SELECT de.person_id,
	  de.drug_exposure_id,
		de.drug_concept_id,
		de.drug_exposure_start_date as start_date,
		MIN(e.end_date) AS drug_exposure_end_date
	FROM cteDrugTarget de
	INNER JOIN cteEndDates e ON de.person_id = e.person_id
		AND de.drug_concept_id = e.drug_concept_id
		AND e.end_date >= de.drug_exposure_start_date
	GROUP BY de.person_id,
	  de.drug_exposure_id,
		de.drug_concept_id,
		de.drug_exposure_start_date
)
select top 10 person_id, drug_concept_id, min(drug_exposure_start_date) as drug_exposure_start_date, drug_exposure_end_date, count(*) as drug_exposure_connt
from cteEnds
group by person_id, drug_concept_id, drug_exposure_end_date
;

Same notes as above, you’ll modify this to insert the final results into your target drug_era table.

This ran in 8.5 minutes on a 5,000,028,361 record table, only selecting the first 10 rows.

Hi, I wanted to add a note related to the drug-era building vs. condition:

In the case of drug-era, we need to make a vocabulary lookup to map the drug_expsoure drug_concept_id into an ingredient-class concept. So, unlike condition, this join to the vocabulary tables could be greatly impacted by your indexes. Therefore, in the case of drug era, you should check that you have indexes on concept and concept_ancestor so that the lookups from drug_exposure back to the ingredient concept can be as fast as possible.

-Chris

@Chris_Knoll

Thank you so much, Chris. We are revising the scripts you provided and will give it a try soon. We will make sure the indexes on concepts are in place.

Jack

Any luck here?

@Chris_Knoll

We are able to run through the new scripts you provided in hours with minor tweaking, which loaded those 3 CTEs to staging tables first. THANK YOU!

One question is that the calculation of gap days was not available in the new script. Is it left out intentionally?

Jack

Right, gap days. I need to locate the query that calculates that, I remember seeing that contributed from the community, but I was just focused on the era-building logic in my response. let me see what I can find.

Ok, so my quck-and-dirty answer (which you’ll have to translate the below to your own query):

Gap days is basically the duration of the calculated era - durations of individual exposures found in the era.group. So, the gap days can be caluclated by adding this to your final select:

datediff(d, min(drug_exposure_start_date), drug_exposure_end_date) - SUM(days_exposed)

The trick is that we need to calculate the days_exposed from the original set of records from cteDrugTarget, so change it:

	select DRUG_EXPOSURE_ID, PERSON_ID, DRUG_CONCEPT_ID, DRUG_EXPOSURE_START_DATE, 
		COALESCE(DRUG_EXPOSURE_END_DATE, DATEADD(day,DAYS_SUPPLY,DRUG_EXPOSURE_START_DATE), DATEADD(day,1,DRUG_EXPOSURE_START_DATE)) as DRUG_EXPOSURE_END_DATE,
DATEDIFF(day, DRUG_EXPOSURE_START_DATE, COALESCE(DRUG_EXPOSURE_END_DATE, DATEADD(day,DAYS_SUPPLY,DRUG_EXPOSURE_START_DATE), DATEADD(day,1,DRUG_EXPOSURE_START_DATE)) as days_exposed
	FROM @TARGET_CDMV5_SCHEMA.DRUG_EXPOSURE de
		join @TARGET_CDMV5_SCHEMA.CONCEPT_ANCESTOR ca on ca.DESCENDANT_CONCEPT_ID = de.DRUG_CONCEPT_ID
		join @TARGET_CDMV5_SCHEMA.CONCEPT c on ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
		where c.VOCABULARY_ID = 'RxNorm'
		and c.CONCEPT_CLASS_ID = 'Ingredient'

You need to include the ‘days_exposed’ in the cteEnds so that you can access it in the final select.

Below is the complete solution:

with cteDrugTarget (DRUG_EXPOSURE_ID, PERSON_ID, DRUG_CONCEPT_ID, DRUG_EXPOSURE_START_DATE, DRUG_EXPOSURE_END_DATE) as
(
-- Normalize DRUG_EXPOSURE_END_DATE to either the existing drug exposure end date, or add days supply, or add 1 day to the start date
	select DRUG_EXPOSURE_ID, PERSON_ID, DRUG_CONCEPT_ID, DRUG_EXPOSURE_START_DATE, 
		COALESCE(DRUG_EXPOSURE_END_DATE, DATEADD(day,DAYS_SUPPLY,DRUG_EXPOSURE_START_DATE), DATEADD(day,1,DRUG_EXPOSURE_START_DATE)) as DRUG_EXPOSURE_END_DATE,
		DATEDIFF(day, DRUG_EXPOSURE_START_DATE, COALESCE(DRUG_EXPOSURE_END_DATE, DATEADD(day,DAYS_SUPPLY,DRUG_EXPOSURE_START_DATE), DATEADD(day,1,DRUG_EXPOSURE_START_DATE)) as DAYS_EXPOSED
	FROM @TARGET_CDMV5_SCHEMA.DRUG_EXPOSURE de
		join @TARGET_CDMV5_SCHEMA.CONCEPT_ANCESTOR ca on ca.DESCENDANT_CONCEPT_ID = de.DRUG_CONCEPT_ID
		join @TARGET_CDMV5_SCHEMA.CONCEPT c on ca.ANCESTOR_CONCEPT_ID = c.CONCEPT_ID
		where c.VOCABULARY_ID = 'RxNorm'
		and c.CONCEPT_CLASS_ID = 'Ingredient'
),
cteEndDates (person_id, drug_concept_id, end_date) AS
(
	SELECT
		person_id,
		drug_concept_id,
		DATEADD(day,-30, event_date) as end_date
	FROM
	(
		SELECT person_id,
			drug_concept_id,
			event_date,
			SUM(event_type) OVER (PARTITION BY person_id, drug_concept_id ORDER BY event_date ROWS UNBOUNDED PRECEDING) AS interval_status
		FROM
		(
			SELECT
				person_id,
				drug_concept_id,
				drug_exposure_start_date AS event_date,
				-1 AS event_type
			FROM cteDrugTarget

			UNION ALL

			SELECT
				person_id,
				drug_concept_id,
				DATEADD(day, 30, drug_exposure_end_date) AS event_date,
				1 AS event_type
			FROM cteDrugTarget
		) RAWDATA
	) e
	WHERE interval_status = 0
),
cteEnds (person_id, drug_exposure_id, drug_concept_id, drug_exposure_start_date, drug_exposure_end_date, days_exposed) AS
(
	SELECT de.person_id,
	  de.drug_exposure_id,
	  de.drug_concept_id,
	  de.drug_exposure_start_date as start_date,
	  MIN(e.end_date) AS drug_exposure_end_date,
	  de.days_exposed
	FROM cteDrugTarget de
	INNER JOIN cteEndDates e ON de.person_id = e.person_id
		AND de.drug_concept_id = e.drug_concept_id
		AND e.end_date >= de.drug_exposure_start_date
	GROUP BY de.person_id,
	  de.drug_exposure_id,
	  de.drug_concept_id,
	  de.drug_exposure_start_date
	  de.days_exposed
)
select top 10 person_id, drug_concept_id, min(drug_exposure_start_date) as drug_exposure_start_date, drug_exposure_end_date,
 count(*) as drug_exposure_connt,
 datediff(day, min(drug_exposure_start_date), drug_exposure_end_date) - SUM(days_exposed)  as gap_days
from cteEnds
group by person_id, drug_concept_id, drug_exposure_end_date
;

I’m unable to test this on my local env this time, so you may experience errors.

Also note: gap_days could be a negative value when you have a lot of drug_exposure overlaps (causing the sum of exposure days within an era to exceed the actual duration of the era). This is referred to as ‘stockpiling’ and it’s not something that is handled in this logic.

@Chris_Knoll

We were able to run through the new script to include calculation of gap_days with minor revisions. Thank you so much for your help.

Jack

t