Long running DDL script for drugs to create hierarchy w/ ingredients and ATC

dramacloak · January 23, 2019, 6:32pm

The code below has been running for 5+ days now to try to setup the drug table, as generated by the DDL script in Atlas. It’s still not finished. Is this run time normal for this query?

I’m running postgresql. Is there any way to optimize the query or optimize Postgresql so that this query runs in a reasonable amount of time (such as adjusting parameters)? Is there another way to get the table so that it doesn’t need to be generated on each setup?

/********** DRUG **********/
INSERT INTO results.concept_hierarchy
(concept_id, concept_name, treemap, level1_concept_name, level2_concept_name, level3_concept_name, level4_concept_name)
SELECT
rxnorm.concept_id,
rxnorm.concept_name AS rxnorm_concept_name,
‘Drug’ AS treemap,
rxnorm.rxnorm_ingredient_concept_name,
atc5_to_atc3.atc5_concept_name,
atc3_to_atc1.atc3_concept_name,
atc1.concept_name AS atc1_concept_name
FROM (
SELECT
c1.concept_id,
c1.concept_name,
c2.concept_id AS rxnorm_ingredient_concept_id,
c2.concept_name AS RxNorm_ingredient_concept_name
FROM vocab.concept c1
INNER JOIN vocab.concept_ancestor ca1 ON c1.concept_id = ca1.descendant_concept_id
AND c1.domain_id = ‘Drug’
INNER JOIN vocab.concept c2 ON ca1.ancestor_concept_id = c2.concept_id
AND c2.domain_id = ‘Drug’
AND c2.concept_class_id = ‘Ingredient’
) rxnorm
LEFT JOIN (
SELECT
c1.concept_id AS rxnorm_ingredient_concept_id,
max(c2.concept_id) AS atc5_concept_id
FROM vocab.concept c1
INNER JOIN vocab.concept_ancestor ca1 ON c1.concept_id = ca1.descendant_concept_id
AND c1.domain_id = ‘Drug’
AND c1.concept_class_id = ‘Ingredient’
INNER JOIN vocab.concept c2 ON ca1.ancestor_concept_id = c2.concept_id
AND c2.vocabulary_id = ‘ATC’
AND c2.concept_class_id = ‘ATC 4th’
GROUP BY c1.concept_id
) rxnorm_to_atc5 ON rxnorm.rxnorm_ingredient_concept_id = rxnorm_to_atc5.rxnorm_ingredient_concept_id
LEFT JOIN (
SELECT
c1.concept_id AS atc5_concept_id,
c1.concept_name AS atc5_concept_name,
max(c2.concept_id) AS atc3_concept_id
FROM vocab.concept c1
INNER JOIN vocab.concept_ancestor ca1 ON c1.concept_id = ca1.descendant_concept_id
AND c1.vocabulary_id = ‘ATC’
AND c1.concept_class_id = ‘ATC 4th’
INNER JOIN vocab.concept c2 ON ca1.ancestor_concept_id = c2.concept_id
AND c2.vocabulary_id = ‘ATC’
AND c2.concept_class_id = ‘ATC 2nd’
GROUP BY c1.concept_id, c1.concept_name
) atc5_to_atc3 ON rxnorm_to_atc5.atc5_concept_id = atc5_to_atc3.atc5_concept_id
LEFT JOIN (
SELECT
c1.concept_id AS atc3_concept_id,
c1.concept_name AS atc3_concept_name,
max(c2.concept_id) AS atc1_concept_id
FROM vocab.concept c1
INNER JOIN vocab.concept_ancestor ca1 ON c1.concept_id = ca1.descendant_concept_id
AND c1.vocabulary_id = ‘ATC’
AND c1.concept_class_id = ‘ATC 2nd’
INNER JOIN vocab.concept c2 ON ca1.ancestor_concept_id = c2.concept_id
AND c2.vocabulary_id = ‘ATC’
AND c2.concept_class_id = ‘ATC 1st’
GROUP BY c1.concept_id, c1.concept_name
) atc3_to_atc1 ON atc5_to_atc3.atc3_concept_id = atc3_to_atc1.atc3_concept_id
LEFT JOIN vocab.concept atc1 ON atc3_to_atc1.atc1_concept_id = atc1.concept_id;

jhajagos · January 24, 2019, 8:42pm

Query optimization can be tricky but here are a couple of tips. I would first run explain to get the query plan (https://www.postgresql.org/docs/11/sql-explain.html). In joins you want to see that the indexes are being used and not a full table scan. I would try to execute each of the sub queries to make sure that they are executing correctly. Newer version of PostGreSQL support parallel execution of queries: https://www.postgresql.org/docs/current/how-parallel-query-works.html which can optimize the query run times.

My main recommendation is to break the query down, make sure you are not getting unexpected results (duplicate rows), and make sure you have correctly defined indices.

dramacloak · January 24, 2019, 11:06pm

It’d be hard for me to optimize this query without knowing a bit more about it’s purpose and what it’s supposed to look like, as I’m still trying to get Atlas to be able to generate a cohort for the first time. I’m not even really sure why it’s needed in Atlas or for cohort generation for that matter, or why the DDL script even exists. It’s been running over 6 days now with no end in sight. I haven’t really been able to find any documentation on the DDL script, so I’m unsure if I’ve missed some important steps that would make this run faster (like indexes). Thanks.

Chris_Knoll · January 25, 2019, 3:44am

What the script is doing is pre-caching a rollup of individual drug concepts to An ATC 5 and ATC 3 class such that the cached structure looks like:
concept_id, level1_concept_name, level2_concept_name, level3_concept_name, level4_concept_name

This is used in our Datasources reporting section of Atlas. It is not needed to generate cohorts.

I’d suggest you look into other environmental problems in your server. Creating indexes on empty tables should be fast, and the concept_hierarchy table should not take that long to materialize.

dramacloak · January 26, 2019, 10:28pm

To address this, I tuned some of Postgresql parameters using https://pgtune.leopard.in.ua/#/

I ran the following SQL code to create indices to speed up the operation:
CREATE INDEX concept_id ON vocab.concept(concept_id);
CREATE INDEX descendant_concept_id ON vocab.concept_ancestor(descendant_concept_id);

I then ran each piece of the query separately, which all ran in < 5 minutes each and worked on joining them together piece by piece. Interestingly, the select command with everything now runs in 5:51 and full command executes in 1:18 minutes with the insert… odd. Recommend adding the vocabulary index’s to the DDL script.

Chris_Knoll · January 27, 2019, 2:15am

Sort of like the ones identified here?

github.com

OHDSI/CommonDataModel/blob/master/PostgreSQL/OMOP CDM postgresql pk indexes.txt

/*********************************************************************************
# Copyright 2014 Observational Health Data Sciences and Informatics
#
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
********************************************************************************/

/************************

 ####### #     # ####### ######      #####  ######  #     #            #####        ###      ######  #    #      ##       ###

This file has been truncated. show original