Execution time of Achilles and PLP queries

guyamit · April 21, 2020, 11:24am

Hi,

I’m setting up an OMOP CDM datasbase & tools for the first time in our organization.
The configuration is as follows:

HW: 8 Xeon cores, 64GB RAM, 2GB SSD
OS: Windows Server 2019
DB: PostgreSQL 12
CDM: version 5.1
Data: IQVIA IMRD (~13M patients, largest table ~2.3G rows)

Following the detailed instructions, I’ve loaded the data, created constraints and indexes, and set up webAPI & Atlas. When running Achilles to create the data statistics I noticed that some of the queries take extremely long time to complete, so that running the entire SQL script generated by achilles may take few days (!).

My questions are:

What is the typical execution time of achilles initial run on a dataset of this size?
Are the sql queries generated by achilles optimized for a specific dialect? are the performance depend on the specific dbms?
for one query (11 Number of non-deceased persons by year of birth and by gender), a change in the syntax resulted in a large performance gain, which indicate that the query can be optimized for better performance:

The original (did not finish overnight):
SELECT
11 as analysis_id, CAST(year_of_birth AS VARCHAR(255)) as stratum_1,
CAST(gender_concept_id AS VARCHAR(255)) as stratum_2,
cast(null as varchar(255)) as stratum_3, cast(null as varchar(255)) as stratum_4, cast(null as varchar(255)) as stratum_5,
COUNT(distinct person_id) as count_value
FROM
imrd1903.person person
where person_id not in (select person_id from imrd1903.death)

the modified version (completed after 25 seconds):
SELECT
11 as analysis_id, CAST(year_of_birth AS VARCHAR(255)) as stratum_1,
CAST(gender_concept_id AS VARCHAR(255)) as stratum_2,
cast(null as varchar(255)) as stratum_3, cast(null as varchar(255)) as stratum_4, cast(null as varchar(255)) as stratum_5,
COUNT(distinct person_id) as count_value
FROM
imrd1903.person person
WHERE NOT EXISTS
(
SELECT NULL
FROM imrd1903.death death
WHERE person.person_id = death.person_id
)

Any other inputs and tips for improving the performance on the described system will be appreciated.

Thanks,
Guy.

Ajit_Londhe · April 20, 2020, 4:16am

Hi @guyamit,

It depends on the database platform and what parallel processing is available. For 140 million patient lives on Redshift (running with 10 concurrent threads), we’ve seen it take about 2-3 hours. With Postgres, you may not have the same ability to gain efficiency through running Achilles with multiple threads as you would with a massively parallel processing (MPP) platform like Redshift, BigQuery, or MS PDW.
The queries are not optimized for 1 specific platform, rather they are written so that they can work against all OHDSI platforms. We do have some optimization for Redshift and MS PDW by issuing a distribution key, but that’s it.
I totally agree with the optimized version you suggested for that query. We have to embrace “not exists” logic in more places throughout the code, as that can be much more performant than “not in” and left join and look for nulls. Can you raise this as in issue in the GitHub repo? https://github.com/OHDSI/Achilles/issues

One idea to try to improve performance is to disable cost-based analyses. These are often the most expensive analyses in the execution, and many researchers don’t really need them as they aren’t interested in summarizing the cost data.

Thanks,
Ajit

guyamit · April 21, 2020, 11:54am

Thanks @Ajit_Londhe for your answers.

It seems that the scope of the problem is beyond sql queries generate by Achilles: when trying to run a PatientLevelPrediction study (from https://github.com/ohdsi-studies/Covid19PredictionStudies) I also encounter extremely long execution time, which makes it non-practical to run. For example, here’s a link to a query for generating a cohort:
https://drive.google.com/file/d/1OpLHsAMe0HD2eyh0TllCDT6QhIELggJB/view?usp=sharing
and here’s the result of running EXPLAIN on this query: https://drive.google.com/file/d/1sUfzGdJZu5vznvnnP98PedHfIqmMwH77/view?usp=sharing

As it isn’t practical to manually optimize these large rendered queries, I’m trying to figure what slows them down. Assuming my hardware and database configuration are ok, the next suspects are the indexes -
Is it possible that my CDM is missing some indexes that are required for these queries to run?
I’ve generated all the indexes using the CDM script 'OMOP CDM indexes required - PostgreSQL.sql ', but since my data is v5.1, maybe there are additional required indexes?

Thanks,
Guy.

MaximMoinat · May 13, 2020, 10:25am

Hi @Ajit_Londhe, I see that recently support for Netezza and Google BQ has been added to Achilles. Have you tested their performance already? How does that compare to e.g. Postgres? Thanks!

gregk · May 13, 2020, 10:59am

@MaximMoinat

Odysseus provides OHDSI support for both BigQuery and Netezza. These databases are very different from PostgreS so even comparing performance is not fair. Netezza is an analytics appliance, BigQuery is a columnar data warehouse built for analytics and autoscaling and PostgreS is an RDBMS. PostgreS would work fine for smaller data sets but will not work well for large data while both other would scale well with data. For example running Optum or Truven on PostgreS would simply choke it. Both BigQuery and Netezza are high performance since they are built for analytics. But - of course - performance would depend on type of SQL that you are running due to the different type of use those type of usage these were built for. We are regular running ACHILLES and other analytics on BigQuery and it works well. We do not have native Netezza (very expensive appliance) and internally use Netezza emulator but our customer does and it is fast as well

Ajit_Londhe · May 13, 2020, 1:52pm

Hi @MaximMoinat - we don’t have access to a test Netezza or BQ environment, so I’ll defer to @gregk’s post

Chris_Knoll · May 13, 2020, 3:29pm

I took a look at the cohort query and the query plan, and noted the following:

Looks like the cohort definition is using visits as the index date, which could lead to selecting a large portion of the visit table (compared to condition_occurrence where you are limiting to specific diagnosis). That simply may be a requirement of the cohort definition, but cohorts based on scanning the entire VISIT table is going to be expensive. Worse, on PostgreSQL, they do not do ‘push down’ predicates into CTEs, so they basically materialize the primary_events cte into memory (or disk if you don’t have it) which leads to a complete copy of the VISIT table (in the event that you use VISIT criteria that would return most of the VISIT table).

On the query plan: I see a few instances where it’s sorting on person_id, visit_start_date. Other cases I see it scanning an IDX_, but it’s not happenign at the root of the query. So, I’d investigate your indexes.

On SqlServer, the way we index all of the patient-level records is by:
cluster index on person_id, {concept_id}, {_start_date}
non-cluster index on concept_id, _start_date

I’m pretty sure PostgreSQL doesn’t support the notion of cluster vs. non-cluster, but the idea is that we cluster inddex on person because that’s typically how you join to the domain tables (looking for a person’s conditions, procedures, etc’. But we also put an index on concept_id because we also usually search for specific conditions across all people. in both cases we include the date of the observation in the index because we’re usually looking for a given concept within a specified time window.

I would experiment on your own environment and try to understand how applying indexes influences the execution plan (try to minimize full table scans, or starting very large tables). I agree with your sentiment that it isn’t pracical to manually optimize the rendered sql, especially since one optimizaton in one platform may be counter-productive to another platform, so it’s best to work off the generated sql as a ‘common’ baseline, and then optimize indexes and DB config based on your specific data geometry. (Ie; maybe partitions would make sense).

gregk · May 14, 2020, 7:25pm

Sorry, I missed the original question from @guyamit

PostgreS is doing OK for datasets for up to 5-7mil range but begin to gradually lose performance as data size is growing. IQVIA IMRD dataset is almost 14mil patients so it is beginning to push PostgreS limits. At some point, we tried loading Optum (~70mil) into a beefed up server running PostgreS to benchmark and test it and it was just choking. It is just not build for that use - it is an RDBMS

If you are working with PostgreS, I would definitely look into fine tuning the database itself first e.g. larger work_mem for exampl. You already have SSD - this is good. Then ensure indexes are created and stats are computed.

But if you do have an option to move to other databases, I would - as Ajit suggested - look into one of the MPP platform as your first choice (AWS RedShift or Google BigQuery).

gregk · May 14, 2020, 7:39pm

Last year, we spent quite a bit of time optimizing RedShift DDL so that it would take advantage of RedShift of data distribution across nodes or copies on each node as well as ordering to better organize data for OHDSI use cases. It really did pay off at the end.

Starting PostgreS 11 or 12 (don’t remember), PostgreS did introduce support more a sophisticated partitioning of data e.g. hash partitioning vs. range. I am still a bit skeptical of how much more we could squeeze out of PostgreS but could be an interesting exercise to see if some tables can be partitioned and it would improve performance for some of the queries

GordonWong · May 14, 2020, 7:45pm

Redshift can give you great ROI on performance if you know how to tune it and your concurrency isn’t too great. Really nailing distribution and sort keys is critical. Also, you should know something about Workload Manager, though its tricky.

I’ve been using Redshift insince 2013 but starting around 2015 Snowflake became my preferred warehousing platform. Superior to Redshift in almost every aspect.

Mark · May 14, 2020, 8:23pm

I would have to disagree with you on data set size performance with Postgres. I have built systems where we were inserting 7 million rows per day into a single dataset and it still performed wonderful. It does take more knowledge to set up Postgres than it does for systems like SQL Server, but from what I have used both systems for, optimized Postgres is much better at handling large datasets.

As far a BigQuery, yes, that should be faster, after all, it is not a RDBMS; I am not arguing against that. Postgres should run as fine as a RDBMS can, if one has it optimized correctly for reads. If one does not have someone on staff to optimize Postgres, EnterpriseDB sells some pretty nice tools and not for prices that will break most budgets.

As for the OP’s original query, there are many useless casts that can be removed to speed up the query as the system AS IS. Get your data first then do casts.

You already figured out how powerful EXISTS is, use it where you can.

I am not familiar with IQVIA IMRD data, but where you are using CAST for dates, use date_part or extract if one can.

As for indices, those are tricky, one can slow the system down by adding more. Testing is always king here.
One general rule is is to have the indices built on a different physical hard drive/SSD than the data tables themselves.

gregk · May 14, 2020, 9:40pm

@Mark. I am not sure you are disagreeing with me In your example, it sounds like PostgreS was used for transactional use cases which is what RDBMSs are built for. What PostgreS is not good at is large scale data analytics on large data sets where you have to scan through enormous amounts of data. It was simply not built to be a data warehouse. Everything has its purpose and optimized for that. I am sure PostgreS is great with typical transactional use cases.

and yes, I am sure we can do a lot more with optimization of our code too not everything can be blamed on RDBMS

Mark · May 15, 2020, 8:50pm

It was a live engineering data warehouse( I understand that a live data warehouse is almost a contradiction in terms; I did not come up with that idea, I had to made it a reality). Due to the fact that some of the engineering analyses commonly rolled though 6 billion rows out of a larger partitioned table, with joins afterwards, we were very read optimized. It was only transactional in the the nature if a batch load failed, it would roll back to a safe state.

My comments are directed at the fact that the OP gave us some givens of what he has to work with, so while using something other than a RDBMS would be idea, I think with the correct optimization to Postgres he should be able to run the queries. I am agreeing with you, that if starting from scratch, your examples would be a better solution.

gregk · May 15, 2020, 6:09pm

very interesting, Mark. Maybe an opportunity for us to learn from your experience. I have looked at OMOP CDM and partitioning types that were available prior to HASH was introduced and didn’t see any obvious way to partition data at the time in any useful way. But would be interested in thinking about that again together?

Mark · May 15, 2020, 8:49pm

Sure. I will have to do it after work hours but I am game.

lychenus13 · September 24, 2021, 10:25am

We were running similar things on postgreSQL, 128GB RAM but much smaller patient size of 1m.

Achilles is stuck on “Generating drug era treemap” for a few days. I am not sure why.

Relevant SQL

select concept_hierarchy.rxnorm_ingredient_concept_id concept_id, 
	concat(COALESCE(concept_hierarchy.atc1_concept_name,'NA'), '||',
	COALESCE(concept_hierarchy.atc3_concept_name,'NA'), '||',
	COALESCE(concept_hierarchy.atc5_concept_name,'NA'), '||',
	COALESCE(concept_hierarchy.rxnorm_ingredient_concept_name,'||')) as concept_path,
	ar1.count_value as num_persons, 
	1.0*ar1.count_value / denom.count_value as percent_persons,
	ar2.avg_value as length_of_era
from (select cast(stratum_1 as int) stratum_1, count_value from test_achilles_ha.achilles_results where analysis_id = 900 GROUP BY analysis_id, stratum_1, count_value) ar1
	inner join
	(select cast(stratum_1 as int) stratum_1, avg_value from test_achilles_ha.achilles_results_dist where analysis_id = 907 GROUP BY analysis_id, stratum_1, avg_value) ar2
	on ar1.stratum_1 = ar2.stratum_1
	inner join
	(
  	select rxnorm.rxnorm_ingredient_concept_id,
			rxnorm.rxnorm_ingredient_concept_name, 
			atc5_to_atc3.atc5_concept_name,
			atc3_to_atc1.atc3_concept_name,
			atc1.concept_name as atc1_concept_name
		from	
		(
		select c2.concept_id as rxnorm_ingredient_concept_id, 
			c2.concept_name as RxNorm_ingredient_concept_name
		from 
			ohdsiv5_july2021_ha.concept c2
			where
			c2.domain_id = 'Drug'
			and c2.concept_class_id in ('Ingredient','CVX')
		) rxnorm
		left join
			(select c1.concept_id as rxnorm_ingredient_concept_id, max(c2.concept_id) as atc5_concept_id
			from
			ohdsiv5_july2021_ha.concept c1
			inner join 
			ohdsiv5_july2021_ha.concept_ancestor ca1
			on c1.concept_id = ca1.descendant_concept_id
			and c1.domain_id = 'Drug'
			and c1.concept_class_id = 'Ingredient'
			inner join 
			ohdsiv5_july2021_ha.concept c2
			on ca1.ancestor_concept_id = c2.concept_id
			and c2.vocabulary_id = 'ATC'
			and c2.concept_class_id = 'ATC 4th'
			group by c1.concept_id
			) rxnorm_to_atc5
		on rxnorm.rxnorm_ingredient_concept_id = rxnorm_to_atc5.rxnorm_ingredient_concept_id

		left join
			(select c1.concept_id as atc5_concept_id, c1.concept_name as atc5_concept_name, max(c2.concept_id) as atc3_concept_id
			from
			ohdsiv5_july2021_ha.concept c1
			inner join 
			ohdsiv5_july2021_ha.concept_ancestor ca1
			on c1.concept_id = ca1.descendant_concept_id
			and c1.vocabulary_id = 'ATC'
			and c1.concept_class_id = 'ATC 4th'
			inner join 
			ohdsiv5_july2021_ha.concept c2
			on ca1.ancestor_concept_id = c2.concept_id
			and c2.vocabulary_id = 'ATC'
			and c2.concept_class_id = 'ATC 2nd'
			group by c1.concept_id, c1.concept_name
			) atc5_to_atc3
		on rxnorm_to_atc5.atc5_concept_id = atc5_to_atc3.atc5_concept_id

		left join
			(select c1.concept_id as atc3_concept_id, c1.concept_name as atc3_concept_name, max(c2.concept_id) as atc1_concept_id
			from
			ohdsiv5_july2021_ha.concept c1
			inner join 
			ohdsiv5_july2021_ha.concept_ancestor ca1
			on c1.concept_id = ca1.descendant_concept_id
			and c1.vocabulary_id = 'ATC'
			and c1.concept_class_id = 'ATC 2nd'
			inner join 
			ohdsiv5_july2021_ha.concept c2
			on ca1.ancestor_concept_id = c2.concept_id
			and c2.vocabulary_id = 'ATC'
  		and c2.concept_class_id = 'ATC 1st'
			group by c1.concept_id, c1.concept_name
			) atc3_to_atc1
		on atc5_to_atc3.atc3_concept_id = atc3_to_atc1.atc3_concept_id

		left join ohdsiv5_july2021_ha.concept atc1
		 on atc3_to_atc1.atc1_concept_id = atc1.concept_id
	) concept_hierarchy
	on ar1.stratum_1 = concept_hierarchy.rxnorm_ingredient_concept_id
	, (select count_value from test_achilles_ha.achilles_results where analysis_id = 1) denom

edit: it turns out that if I run from Python psycopg2 it finishes running in 6 minutes. But in R package of Achilles it does not return anything. I am not sure what is wrong again. le sigh

lychenus13 · September 24, 2021, 10:29am

btw just replying to the original poster. we have another database of 16m patient on psql, 128GB RAM 8CPUs, runs good. Achilles took 24h to finish the whole thing