Atlas Cohort Generation Optimization

alex_chop · September 19, 2023, 4:28pm

We are in the process of ramping up our Atlas usage and wanting to increase the amount of people running jobs on it.

However, we have run into a substantial bottleneck in the time it takes to run these jobs, generally cohort creation, they take upwards of 10-20 hours and some even longer.

We have the CDM data on a remote postgres database, we have about 9 million patients in the database where the cohort creation is happening.

Was looking for a place to start investigating what is causing the large runtimes, would having the WebAPI on the same server as the database decrease time, generally increasing the resources on the server the WebAPI is currently deployed on?

Any help would be greatly appreciated and happy to share specs and any other relevant information.

Best,
Alex

Chris_Knoll · September 20, 2023, 3:20am

No, putting the WebAPI on the same server as the database would not decrease time since it’s not network bandwidth that’s the limiting factor here. It is extremely hard to diagnose/profile a database via forum.

What I’d suggest is that you export one of those long running cohort generations SQL from the atlas UI, and put it into a PG admin session. From here, you can execute the queries piece-by-piece and use the EXPLAIN command on the statement for it to analyize what the query structure is doing with the SQL. You may find that it is not using indexes, or that work memory is spilling into disk slowing down your IO. I think if you do EXPLAIN ANALYSE it will execute the statement and report runtime-statistics. The link above also gives info on additional paramaters for reporting on BUFFERS, etc.

Database profiling is a bit of an art…it depends on your infrastructure, hardware, server configuration, etc. It can be very worth while to pick up the skill because the better you are at it, the faster you can identify the bottlenecks in your environment. If you do spend the time on investigating, I’d very much appreciate it if you reported your findings.

gregk · September 20, 2023, 8:58am

As @Chris_Knoll said it, optimizing could be pretty complex (and gave you some good tips), yet sometimes you could start with some simple things.

I would start with a simple approach:

Check the OMOP is indexed. Also make sure PostgreS is not configured to queue up job requests beyond a certain number, there are enough worker threads etc…
PostgreS database is an RDBMS and as such not really optimized to run analytics, especially on large data. Some people swear PostgreS can scale but I think there are much better options available. You are running a pretty sizable database - 9mil patients. One option is to move to a more scalable solution - AWS RedShift, Google BigQuery or Snowflake were created with analytics at scale in mind. If that is one of your options - I would seriously consider it before even running any more optimization
if #2 is not an option, you can easily check your server resources and see where your possible bottleneck is - CPU, RAM, disk or all of the above.

Check the # of CPU cores and load on each. If you are maxing out cores, then jobs will be be fighting for CPU
Check if you allocated enough RAM for buffers, to do sorting etc… there are plenty optimization manuals available
It is possible that your RAM is maxed out and it is swapping on disk. If the disk is a good old fashioned HDD - then you are loosing on speed in general, and this definitely would significantly degrade your performance if there is a disk swapping.

very likely you will see it is one of these. Then you have a choice - keep throwing hardware at it or - if you do not waste your money and time - you can consider option 2

good luck!

Chris_Knoll · September 20, 2023, 1:42pm

Greg gives good guidance here, and I can say from my expereince, CPU has not been the bottleneck in my experience: when dealing with large scale data it’s about how fast you can get the data from disk and how much of it can you hold in memory to work with.

I believe there are elements of EXPLAIN ANALYZE that will inform you if your query resulted in cache hits/misses, temp storage, etc. These are all indications that you may need to upsize your infrastructure, or configure it to give you more resources.

alex_chop · September 20, 2023, 7:11pm

Thank you both for the opinions they’re greatly appreciated.

As of now we don’t have the means to head into the cloud but have plenty of local resources to throw at the servers if we need too. Right now they are pretty beefy and we haven’t seen over user of RAM or CPU and everything is on SSD as well.

We will start with the Analyze piece of things we have been working to get things optimized more generally on postgres in terms of buffers etc. but still have a long ways to go considering we are starting with little to knowledge in that regard.

I will be sure to keep you updated based on how things go.

Thanks again.

Chris_Knoll · September 21, 2023, 12:08pm

I understand fully, I had the same starting point back in my MS Sql days, and had to re-learn some things when we moved over to PDW…and now we’re on Redshift, and there’s a new set of things to discover.

I’m pretty sure you can get PG to perform on datasets < 1TB. Just need to find the online resources that can help you evaluate your runtime profile and make the necessary adjustments. SSD is good (fast IO) and loads of ram is good (keep those joins in memory) but the default PG settings (if I recall correctly) are meant to keep the DB footprint small…but you want a big footprint to do some heavy lifting.

I’d read through this article as a starting point.