ERAs tables Memory issue

clairblacketer · January 5, 2021, 5:34pm

I am moving this question over to the forums as I think more people will see it over here than on the CDM github.

Hi!

We are working on the mapping of our database to the OMOP CDM. We have been able to run all the ETLs designed, but when executing the ERA ones (taken from sqlScripts.utf8) we run into this error:

“(74360397 rows affected)
Mens 1101, Level 17, State 10, Procedure dbo.CREATE_DRUG_ERA, Line 44
A new page could not be allocated for database ‘TEMPDB’ because filegroup ‘DEFAULT’ has insufficient disk space. Remove objects from the filegroup, add additional files to the filegroup, or set the auto-grow option for existing files in the filegroup to create the necessary space.”

Condition_occurrence and drug_exposure are populated. We have a similar problem when running condition_era. Has any of you found the same issue? If so, how to solve it?

Thanks a lot!

github user Carlogo815

anna_hammais · June 6, 2024, 7:43am

Hi!

We are having the same problem and are working on a solution. Our first idea is to perhaps modify the script to loop over smaller groups of persons at a time, so that the resulting temporary objects would not be so large. We have not tested this yet. Our particular problem concerns the contition_era table.

Does anyone have better suggestions? Any advice would be appreciated!

rookie_crewkie · June 6, 2024, 1:04pm

Hello @anna_hammais,

Assuming that you’re also using the algorithm mentioned above: you could try rewriting the largest query into several ones to use temp tables instead of subqueries. Sometimes the query planner may get tripped up by the joins ( #cteConditionTarget table in particular) and therefore request way more resources than actually required to run the query. Yes, you would use more storage space, but the data will be processed in smaller chunks and the execution plan should be better.

That being said, partitioning by person_id and processing in chunks is also a valid optimization technique, because this field is used either as a join or as an aggregate key anyway.

Chris_Knoll · June 6, 2024, 10:13pm

We’ve made some advances on era-fying drug exposure and condition occurrence records and wanted to share here the new technique that should be less pressure on temp db and other resources.

The new form as as follows:

create table #temp_condition_era
DISTKEY(person_id)
as
select person_id, condition_concept_id, min(start_date) as era_start_date, DATEADD(day,-30, max(end_date)) as era_end_date
from (
  select person_id, condition_concept_id, start_date, end_date, sum(is_start) over (partition by person_id, condition_concept_id order by start_date, is_start desc rows unbounded preceding) group_idx
  from (
    select person_id, condition_concept_id, start_date, end_date, 
      case when max(end_date) over (partition by person_id, condition_concept_id order by start_date rows between unbounded preceding and 1 preceding) >= start_date then 0 else 1 end is_start
    from (
      select person_id, condition_concept_id, condition_start_date as start_date, DATEADD(day,30,coalesce(condition_end_date, condition_start_date)) as end_date
      FROM cdm_schema.condition_occurrence
    ) DT
  ) ST
) GR
group by person_id, condition_concept_id, group_idx

Note this is in a redshift syntax where i’m taking the results of this query and putting it into a temp table and distributing on personid.

You should be able to take the above query and make your own modifications to work in your schema and RDBMS, but the main point where the records to erify comes from is in the middle subquery DT:

      select person_id, condition_concept_id, condition_start_date as start_date, DATEADD(day,30,coalesce(condition_end_date, condition_start_date)) as end_date
      FROM cdm_schema.condition_occurrence

You can select any source records that have start/end dates you want, but note in this example we’re selecting person and condition_concept_id so we need to partition by person_id, condition_concept_id in the places we are doing the over clause.

This should be much faster and more memory efficient to do it this way. Please let me know if you have any questions.

anna_hammais · June 10, 2024, 10:53am

Thank you very much @rookie_crewkie and @Chris_Knoll ! We will definitely try these.