Tips for configuring WebAPI for large number of end users (~200) and individuals (~500k) in a BigQuery database

Rasko_Leinonen · July 17, 2023, 8:18am

Dear OHDSI community,

Are there any tips for configuring WebAPI for a large number of end users (~200) and individuals (~500k) in a BigQuery database? For example, what parameter values would you recommend for:

spring.datasource.hikari.maximum-pool-size=
spring.datasource.hikari.minimum-idle=
spring.batch.taskExecutor.corePoolSize=
spring.batch.taskExecutor.maxPoolSize=

and would you have any other recommendations e.g. should we run multiple instances (Kubernetes pods) of the WebAPI service?

Best wishes,
Rasko Leinonen

Chris_Knoll · July 17, 2023, 1:46pm

You dont’ need to worry about these params because they only come into play in connecting to your authorization datasource (ie: you set up a database with user/password that is used to authorize).

These are used to control the maximum number of concurrent threads when running Atlas analysis queries (cohort generation, Incidence rates, Chacterization, vocabulary searches). I did a quick search and they talk about 100 queries per reservation which I’m not sure what that means. In database platforms like Microsoft Azure/PDW, or Redshift, there is a limit to 30-60 concurrent queries against a cluster, so this is where you’d want to specify the maxPoolSize to limit that.

Rasko_Leinonen · July 19, 2023, 9:59am

Thank you Chris,

Would you have any other recommendations e.g. are there other parameters that affect maximum parallelization?

Best wishes,
Rasko

Chris_Knoll · July 19, 2023, 1:19pm

I am not well versed in BigQuery, so I’m not sure what the mechanism for data distribution is. But for things like Redshift and PDW, one important consideration is your ‘distribution key’ that is used to divide the data across the cluster. A good distribution means data is evenly distributed across the nodes so that your distributed queries process similar volumes of data across nodes. At the same time, you want to distribute on the same key as your joins so that you don’t do cross-node joins.

What this means for a CDM context is that you’ll distribute your data on person_id: person IDs are going to be even distributed across patients (it’s a unique ID), it’s likely that patient-level data will be evenly distributed across all people (some people will have a lot, some people a little, but that distribution will be random across person_id), and most importantly: a given patient_id is linked to all domains of data via their person_id, so you’ll be joining on this column the most, so you want a persons health records all co-located on the same node of a cluster.

Beyond that, from a BigQuery perspective, I would recommend @tom.white.md and @Ajit_Londhe for their input because they have technical expertise.