OHDSI Home | Forums | Wiki | Github

Execution time of Achilles and PLP queries


(Guy Amit) #1

Hi,

I’m setting up an OMOP CDM datasbase & tools for the first time in our organization.
The configuration is as follows:

  • HW: 8 Xeon cores, 64GB RAM, 2GB SSD
  • OS: Windows Server 2019
  • DB: PostgreSQL 12
  • CDM: version 5.1
  • Data: IQVIA IMRD (~13M patients, largest table ~2.3G rows)

Following the detailed instructions, I’ve loaded the data, created constraints and indexes, and set up webAPI & Atlas. When running Achilles to create the data statistics I noticed that some of the queries take extremely long time to complete, so that running the entire SQL script generated by achilles may take few days (!).

My questions are:

  1. What is the typical execution time of achilles initial run on a dataset of this size?

  2. Are the sql queries generated by achilles optimized for a specific dialect? are the performance depend on the specific dbms?

  3. for one query (11 Number of non-deceased persons by year of birth and by gender), a change in the syntax resulted in a large performance gain, which indicate that the query can be optimized for better performance:

The original (did not finish overnight):
SELECT
11 as analysis_id, CAST(year_of_birth AS VARCHAR(255)) as stratum_1,
CAST(gender_concept_id AS VARCHAR(255)) as stratum_2,
cast(null as varchar(255)) as stratum_3, cast(null as varchar(255)) as stratum_4, cast(null as varchar(255)) as stratum_5,
COUNT(distinct person_id) as count_value
FROM
imrd1903.person person
where person_id not in (select person_id from imrd1903.death)

the modified version (completed after 25 seconds):
SELECT
11 as analysis_id, CAST(year_of_birth AS VARCHAR(255)) as stratum_1,
CAST(gender_concept_id AS VARCHAR(255)) as stratum_2,
cast(null as varchar(255)) as stratum_3, cast(null as varchar(255)) as stratum_4, cast(null as varchar(255)) as stratum_5,
COUNT(distinct person_id) as count_value
FROM
imrd1903.person person
WHERE NOT EXISTS
(
SELECT NULL
FROM imrd1903.death death
WHERE person.person_id = death.person_id
)

Any other inputs and tips for improving the performance on the described system will be appreciated.

Thanks,
Guy.


(Ajit Londhe) #2

Hi @guyamit,

  1. It depends on the database platform and what parallel processing is available. For 140 million patient lives on Redshift (running with 10 concurrent threads), we’ve seen it take about 2-3 hours. With Postgres, you may not have the same ability to gain efficiency through running Achilles with multiple threads as you would with a massively parallel processing (MPP) platform like Redshift, BigQuery, or MS PDW.

  2. The queries are not optimized for 1 specific platform, rather they are written so that they can work against all OHDSI platforms. We do have some optimization for Redshift and MS PDW by issuing a distribution key, but that’s it.

  3. I totally agree with the optimized version you suggested for that query. We have to embrace “not exists” logic in more places throughout the code, as that can be much more performant than “not in” and left join and look for nulls. Can you raise this as in issue in the GitHub repo? https://github.com/OHDSI/Achilles/issues

One idea to try to improve performance is to disable cost-based analyses. These are often the most expensive analyses in the execution, and many researchers don’t really need them as they aren’t interested in summarizing the cost data.

Thanks,
Ajit


(Guy Amit) #3

Thanks @Ajit_Londhe for your answers.

It seems that the scope of the problem is beyond sql queries generate by Achilles: when trying to run a PatientLevelPrediction study (from https://github.com/ohdsi-studies/Covid19PredictionStudies) I also encounter extremely long execution time, which makes it non-practical to run. For example, here’s a link to a query for generating a cohort:
https://drive.google.com/file/d/1OpLHsAMe0HD2eyh0TllCDT6QhIELggJB/view?usp=sharing
and here’s the result of running EXPLAIN on this query: https://drive.google.com/file/d/1sUfzGdJZu5vznvnnP98PedHfIqmMwH77/view?usp=sharing

As it isn’t practical to manually optimize these large rendered queries, I’m trying to figure what slows them down. Assuming my hardware and database configuration are ok, the next suspects are the indexes -
Is it possible that my CDM is missing some indexes that are required for these queries to run?
I’ve generated all the indexes using the CDM script 'OMOP CDM indexes required - PostgreSQL.sql ', but since my data is v5.1, maybe there are additional required indexes?

Thanks,
Guy.


(Maxim Moinat) #5

Hi @Ajit_Londhe, I see that recently support for Netezza and Google BQ has been added to Achilles. Have you tested their performance already? How does that compare to e.g. Postgres? Thanks!


(Gregory Klebanov) #6

@MaximMoinat

Odysseus provides OHDSI support for both BigQuery and Netezza. These databases are very different from PostgreS so even comparing performance is not fair. Netezza is an analytics appliance, BigQuery is a columnar data warehouse built for analytics and autoscaling and PostgreS is an RDBMS. PostgreS would work fine for smaller data sets but will not work well for large data while both other would scale well with data. For example running Optum or Truven on PostgreS would simply choke it. Both BigQuery and Netezza are high performance since they are built for analytics. But - of course - performance would depend on type of SQL that you are running due to the different type of use those type of usage these were built for. We are regular running ACHILLES and other analytics on BigQuery and it works well. We do not have native Netezza (very expensive appliance) and internally use Netezza emulator but our customer does and it is fast as well


(Ajit Londhe) #7

Hi @MaximMoinat - we don’t have access to a test Netezza or BQ environment, so I’ll defer to @gregk’s post


(Chris Knoll) #8

I took a look at the cohort query and the query plan, and noted the following:

Looks like the cohort definition is using visits as the index date, which could lead to selecting a large portion of the visit table (compared to condition_occurrence where you are limiting to specific diagnosis). That simply may be a requirement of the cohort definition, but cohorts based on scanning the entire VISIT table is going to be expensive. Worse, on PostgreSQL, they do not do ‘push down’ predicates into CTEs, so they basically materialize the primary_events cte into memory (or disk if you don’t have it) which leads to a complete copy of the VISIT table (in the event that you use VISIT criteria that would return most of the VISIT table).

On the query plan: I see a few instances where it’s sorting on person_id, visit_start_date. Other cases I see it scanning an IDX_, but it’s not happenign at the root of the query. So, I’d investigate your indexes.

On SqlServer, the way we index all of the patient-level records is by:
cluster index on person_id, {concept_id}, {_start_date}
non-cluster index on concept_id, _start_date

I’m pretty sure PostgreSQL doesn’t support the notion of cluster vs. non-cluster, but the idea is that we cluster inddex on person because that’s typically how you join to the domain tables (looking for a person’s conditions, procedures, etc’. But we also put an index on concept_id because we also usually search for specific conditions across all people. in both cases we include the date of the observation in the index because we’re usually looking for a given concept within a specified time window.

I would experiment on your own environment and try to understand how applying indexes influences the execution plan (try to minimize full table scans, or starting very large tables). I agree with your sentiment that it isn’t pracical to manually optimize the rendered sql, especially since one optimizaton in one platform may be counter-productive to another platform, so it’s best to work off the generated sql as a ‘common’ baseline, and then optimize indexes and DB config based on your specific data geometry. (Ie; maybe partitions would make sense).


(Gregory Klebanov) #9

Sorry, I missed the original question from @guyamit

PostgreS is doing OK for datasets for up to 5-7mil range but begin to gradually lose performance as data size is growing. IQVIA IMRD dataset is almost 14mil patients so it is beginning to push PostgreS limits. At some point, we tried loading Optum (~70mil) into a beefed up server running PostgreS to benchmark and test it and it was just choking. It is just not build for that use - it is an RDBMS

If you are working with PostgreS, I would definitely look into fine tuning the database itself first e.g. larger work_mem for exampl. You already have SSD - this is good. Then ensure indexes are created and stats are computed.

But if you do have an option to move to other databases, I would - as Ajit suggested - look into one of the MPP platform as your first choice (AWS RedShift or Google BigQuery).


(Gregory Klebanov) #10

Last year, we spent quite a bit of time optimizing RedShift DDL so that it would take advantage of RedShift of data distribution across nodes or copies on each node as well as ordering to better organize data for OHDSI use cases. It really did pay off at the end.

Starting PostgreS 11 or 12 (don’t remember), PostgreS did introduce support more a sophisticated partitioning of data e.g. hash partitioning vs. range. I am still a bit skeptical of how much more we could squeeze out of PostgreS but could be an interesting exercise to see if some tables can be partitioned and it would improve performance for some of the queries


(Gordon Wong) #11

Redshift can give you great ROI on performance if you know how to tune it and your concurrency isn’t too great. Really nailing distribution and sort keys is critical. Also, you should know something about Workload Manager, though its tricky.

I’ve been using Redshift insince 2013 but starting around 2015 Snowflake became my preferred warehousing platform. Superior to Redshift in almost every aspect.


(Mark Seal) #12

I would have to disagree with you on data set size performance with Postgres. I have built systems where we were inserting 7 million rows per day into a single dataset and it still performed wonderful. It does take more knowledge to set up Postgres than it does for systems like SQL Server, but from what I have used both systems for, optimized Postgres is much better at handling large datasets.

As far a BigQuery, yes, that should be faster, after all, it is not a RDBMS; I am not arguing against that. Postgres should run as fine as a RDBMS can, if one has it optimized correctly for reads. If one does not have someone on staff to optimize Postgres, EnterpriseDB sells some pretty nice tools and not for prices that will break most budgets.

As for the OP’s original query, there are many useless casts that can be removed to speed up the query as the system AS IS. Get your data first then do casts.

You already figured out how powerful EXISTS is, use it where you can.

I am not familiar with IQVIA IMRD data, but where you are using CAST for dates, use date_part or extract if one can.

As for indices, those are tricky, one can slow the system down by adding more. Testing is always king here.
One general rule is is to have the indices built on a different physical hard drive/SSD than the data tables themselves.


(Gregory Klebanov) #13

@Mark. I am not sure you are disagreeing with me :slight_smile: In your example, it sounds like PostgreS was used for transactional use cases which is what RDBMSs are built for. What PostgreS is not good at is large scale data analytics on large data sets where you have to scan through enormous amounts of data. It was simply not built to be a data warehouse. Everything has its purpose and optimized for that. I am sure PostgreS is great with typical transactional use cases.

and yes, I am sure we can do a lot more with optimization of our code too :slight_smile: not everything can be blamed on RDBMS


(Mark Seal) #14

It was a live engineering data warehouse( I understand that a live data warehouse is almost a contradiction in terms; I did not come up with that idea, I had to made it a reality). Due to the fact that some of the engineering analyses commonly rolled though 6 billion rows out of a larger partitioned table, with joins afterwards, we were very read optimized. It was only transactional in the the nature if a batch load failed, it would roll back to a safe state.

My comments are directed at the fact that the OP gave us some givens of what he has to work with, so while using something other than a RDBMS would be idea, I think with the correct optimization to Postgres he should be able to run the queries. I am agreeing with you, that if starting from scratch, your examples would be a better solution.


(Gregory Klebanov) #15

very interesting, Mark. Maybe an opportunity for us to learn from your experience. I have looked at OMOP CDM and partitioning types that were available prior to HASH was introduced and didn’t see any obvious way to partition data at the time in any useful way. But would be interested in thinking about that again together?


(Mark Seal) #16

Sure. I will have to do it after work hours but I am game.


t