Achilles ExportJson issue with Postgres

Juan_Banda · January 12, 2015, 7:23pm

We are stuck in the drug treemap generation at 0%. I left the script running over the weekend and after 68 hours it is still executing the following SELECT statement:

select     concept_hierarchy.concept_id,
    COALESCE(concept_hierarchy.atc1_concept_name,'NA') || '||' ||
    COALESCE(concept_hierarchy.atc3_concept_name,'NA') || '||' ||
    COALESCE(concept_hierarchy.atc5_concept_name,'NA') || '||' ||
  COALESCE(concept_hierarchy.rxnorm_ingredient_concept_name,'NA') || '||' ||
    concept_hierarchy.rxnorm_concept_name concept_path,
    ar1.count_value as num_persons,
    ROUND(CAST(1.0*ar1.count_value / denom.count_value AS NUMERIC),5) as percent_persons,
    ROUND(CAST(1.0*ar2.count_value / ar1.count_value AS NUMERIC),5) as records_per_person
from (select * from ACHILLES_results where analysis_id = 700) ar1
    inner join
    (select * from ACHILLES_results where analysis_id = 701) ar2
    on ar1.stratum_1 = ar2.stratum_1
    inner join
    (
        select rxnorm.concept_id,
            rxnorm.concept_name as rxnorm_concept_name,
            rxnorm.rxnorm_ingredient_concept_name,
            atc5_to_atc3.atc5_concept_name,
            atc3_to_atc1.atc3_concept_name,
            atc1.concept_name as atc1_concept_name
        from   
        (
        select c1.concept_id,
            c1.concept_name,
            c2.concept_id as rxnorm_ingredient_concept_id,
            c2.concept_name as RxNorm_ingredient_concept_name
        from ohdsiv5.concept c1
            inner join ohdsiv5.concept_ancestor ca1
            on c1.concept_id = ca1.descendant_concept_id
            and c1.vocabulary_id = 'RxNorm'
            inner join ohdsiv5.concept c2
            on ca1.ancestor_concept_id = c2.concept_id
            and c2.vocabulary_id = 'RxNorm'
            and c2.concept_class_id = 'Ingredient'
        ) rxnorm
        left join
            (select c1.concept_id as rxnorm_ingredient_concept_id, max(c2.concept_id) as atc5_concept_id
            from
            ohdsiv5.concept c1
            inner join
            ohdsiv5.concept_ancestor ca1
            on c1.concept_id = ca1.descendant_concept_id
            and c1.vocabulary_id = 'RxNorm'
            and c1.concept_class_id = 'Ingredient'
            inner join
            ohdsiv5.concept c2
            on ca1.ancestor_concept_id = c2.concept_id
            and c2.vocabulary_id = 'ATC'
            and c2.concept_class_id = 'ATC 4th'
            group by c1.concept_id
            ) rxnorm_to_atc5
        on rxnorm.rxnorm_ingredient_concept_id = rxnorm_to_atc5.rxnorm_ingredient_concept_id

        left join
            (select c1.concept_id as atc5_concept_id, c1.concept_name as atc5_concept_name, max(c2.concept_id) as atc3_concept_id
            from
            ohdsiv5.concept c1
            inner join
            ohdsiv5.concept_ancestor ca1
            on c1.concept_id = ca1.descendant_concept_id
            and c1.vocabulary_id = 'ATC'
            and c1.concept_class_id = 'ATC 4th'
            inner join
            ohdsiv5.concept c2
            on ca1.ancestor_concept_id = c2.concept_id
            and c2.vocabulary_id = 'ATC'
            and c2.concept_class_id = 'ATC 2nd'
            group by c1.concept_id, c1.concept_name
            ) atc5_to_atc3
        on rxnorm_to_atc5.atc5_concept_id = atc5_to_atc3.atc5_concept_id

        left join
            (select c1.concept_id as atc3_concept_id, c1.concept_name as atc3_concept_name, max(c2.concept_id) as atc1_concept_id
            from
            ohdsiv5.concept c1
            inner join
            ohdsiv5.concept_ancestor ca1
            on c1.concept_id = ca1.descendant_concept_id
            and c1.vocabulary_id = 'ATC'
            and c1.concept_class_id = 'ATC 2nd'
            inner join
            ohdsiv5.concept c2
            on ca1.ancestor_concept_id = c2.concept_id
            and c2.vocabulary_id = 'ATC'
          and c2.concept_class_id = 'ATC 1st'
            group by c1.concept_id, c1.concept_name
            ) atc3_to_atc1
        on atc5_to_atc3.atc3_concept_id = atc3_to_atc1.atc3_concept_id

        left join ohdsiv5.concept atc1
         on atc3_to_atc1.atc1_concept_id = atc1.concept_id
    ) concept_hierarchy
    on ar1.stratum_1 = CAST(concept_hierarchy.concept_id AS VARCHAR)
    ,
    (select count_value from ACHILLES_results where analysis_id = 1) denom

order by ar1.count_value desc

Indexes are enabled and database is tuned for performance, anybody having the same issue?

Vojtech_Huser · January 22, 2015, 4:53pm

I was also studying the Achilles.sql file and it is quite a lot of SQL queries. Perhaps you could skip some queries that don’t result in a crucial interim table used later in the analysis.
The process would still complete with simply some JSON data missing.

Having more comments in that large sql file would be helpful.
Also the numbering scheme for queries/ analyses could be better explained.

I started with extracting all comments to try to understand it.
(see my post about Achilles Heel)

Juan_Banda · January 22, 2015, 7:52pm

Thanks for the input. I ended up removing the drug reports and it ran like a charm!

There is a nice reports argument that can be altered to select which reports to run.

someReports <- c("CONDITION","CONDITION_ERA", "DASHBOARD", "DATA_DENSITY", "DEATH","HEEL","OBSERVATION","OBSERVATION_PERIOD","PERSON","PROCEDURE","VISIT")

And then calling the export with the following:

exportToJson(connectionDetails, cdmSchema = "ohdsiv5", resultsSchema = "results_schema",outputPath = "/home/jmbanda/AchillesExport2",cdmVersion="5", report=someReports)

Chris_Knoll · January 25, 2015, 6:07am

Sorry for the problems, Juan. I’m thinking that part of the work could be trying to figure out the drug hierarchy from the vocab. It’s all those atc1/3/5 subquereis that are being built and joined. If you are game, perhaps you’d like to take that query, strip out the ATC1/3/5 subquery parts so that the root select just selects the aggregates and not any of the concept hierarchy, and see if that runs any faster?

If it does, the solution could be indexes on vocabulary, or we just need to work on finding a new way to determine those hierarchies…

-Chris

Juan_Banda · January 26, 2015, 7:08pm

Thanks for the suggestion @Chris_Knoll, I left the full export program running over the weekend-ish and it finally completed in 3.53 days. I will definitely try to strip of out the sub queries when I try to run it again. It did take longer to run than the actual analysis (2.8 days).

With all indexes loaded, some extra indexes here and there and with some minor fine tuning, the scripts took this long. Our current data export is around 400 GB’s of data (and indexes) on a Postgres database on a temp server, we will move it to a faster and bigger machine soon so that might be part of the issue. I will keep you posted if I strip out the code and let you know how much improvement we see.

Here are our execution time screen captures:
Analysis:

Export:

lychenus13 · September 27, 2021, 2:10am

Same happened. No idea why it was stuck. It is only a small cohort of 1million patients. The database is only around 30GB, and we have way more RAM than it! (128GB of RAM). It does not return anything after 3 days running it.

I tried to extract the relevant SQL by forcing R to stop, putting the SQL to run in psycopg2.py. The result is returned in 6 minutes.

No idea what is wrong totally.

Chris_Knoll · September 27, 2021, 6:11pm

Is there any way to inspect locks on running transactions? It’s odd that if you run it raw, you get 6 minutes, but through JDBC in R, you never finish after 3 days…to me, smells like a deadlock. It would be interesting to be able to look at database statistics to see if there’s any locks.

lychenus13 · September 28, 2021, 8:25am

There weren’t any locks as I checked with pg_stat_activity.

When I run the SQL with psycopg python, sometimes the code would show up on the graph as 5 different parts. Most of them would be on ‘wait’ as “messageQueueSend IPC” something. I terminate those queries and try to run it on ‘single thread’ (not sure).

Very oddly, drug_era managed to have their Achilles SQL run and finished today. Within 20 minutes. I didn’t fix anything at all (I suppose). But for drug_exposure, it is still running for a few hours. drug_era was derived from drug_exposure so I suppose drug_exposure should also work. No idea what it isn’t working.

The said database has like 200 million of lines in drug_exposure. We have another database with 2 billion lines in drug_exposure and it can still be finished in 6 hours or so. Really not sure what the problem is.

I suppose everything is okay on my side. There isn’t much data in the source tables so there are not a lot of things to fill in in the OMOP tables.

Mark · September 28, 2021, 1:09pm

Juan_Banda:

COALESCE(concept_hierarchy.atc1_concept_name,'NA') || '||' ||
    COALESCE(concept_hierarchy.atc3_concept_name,'NA') || '||' ||
    COALESCE(concept_hierarchy.atc5_concept_name,'NA') || '||' ||
  COALESCE(concept_hierarchy.rxnorm_ingredient_concept_name,'NA') || '||' ||
    concept_hierarchy.rxnorm_concept_name concept_path,

String concatenation is very expensive. This may not be your problem at all, as Chris mentioned, it very well may be a dead lock, but this is a code smell. Perhaps load the results into a CTE before doing all of the string manipulation?

lychenus13 · September 29, 2021, 2:10am

I do not understand why string contact in expensive in SQL. Wasn’t taught that, or at least if I have to do it I would do it in some ordinary programming language.

Anyway, I keep the script running and it is still running 20 hours later (on that very small 1 million cohort and 30GB database). I guess if it was that string concat thing it can still be faster in slow things like python

Mark · September 29, 2021, 5:20pm

It would be helpful to see the query plan that the R application is creating. It is most likely not the same query plan that has been created from python. I am not an expert in R, far from it in fact, but most high level languages (unless they fire up a pg console in the background) do not run the query as seen, but creates direct request from what SQL is given.

As to why string concatenation is expensive, it is that SQL is optimized to work in Set theory. If the query plan has the string concatenating after all the data is returned, the expense is not big enough to affect the query, usually, overmuch. If the string concats are done in the middle of the query plan, it interrupts the Set operations. This is why I suggested putting the results in a CTE and concatenating the string afterwards. If query optimization is turned off, this should speed up the query.

Speaking of query optimization, does the driver that R is using have an option to turn query optimization on or off?