OHDSI Home | Forums | Wiki | Github

Optimizing omop database for Achilles

As we know, Achilles provides descriptive statistics on an OMOP CDM database. However, since I started Achilles it has been nearly 2.5 weeks to cover up to Analysis 1325 (and still running). So, I wonder what optimizations as indexes, etc you used to make Achilles go faster. My Achilles cmd is:

achilles(connectionDetails,
cdmDatabaseSchema = “mdcr2003_2020”,
resultsDatabaseSchema = “atr2003_2020”,
scratchDatabaseSchema = “achilles_scratch”,
vocabDatabaseSchema = “omop_20220331”,
numThreads = 1,
sourceName = “src_mdcr2003_2020”,
cdmVersion = “5.3”,
runHeel = FALSE,
runCostAnalysis = FALSE,
tempAchillesPrefix = ‘ta’,
optimizeAtlasCache = TRUE,
createIndices = TRUE)

Thanks

A typical Achilles execution should take a few hours at the most. Could you describe your environment in terms or the following:

  • Server infrastructure: CPU / Memory / Disk size
  • CDM size (GB) and population size (number of people)

Could you also provide the log of your Achilles execution? You can send it to me directly (fdefalco at ohdsi dot org)

Thanks,
Frank

Server infrastructure:

Achilles runs in a docker container where:
CPU:

lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 1
On-line CPU(s) list: 0
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Stepping: 7
CPU MHz: 2294.609
BogoMIPS: 4589.21
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 22528K
NUMA node0 CPU(s): 0
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni md_clear flush_l1d arch_capabilities

Memory:
grep MemTotal /proc/meminfo
MemTotal: 16228440 kB

Data: PostgreSQL 10 with 20TB space

CDM size:
select count(*) from condition_occurrence;
count

2925310457
(1 row)

select count(*) from cost;
count

3203132083
(1 row)

select count(*) from death;
count

271317
(1 row)

select count(*) from device_exposure;
count

92277233
(1 row)

select count(*) from drug_era;
count

390797680
(1 row)

select count(*) from drug_exposure;
count

1184255727
(1 row)

select count(*) from measurement;
count

287929387
(1 row)

select count(*) from observation;
count

847634161
(1 row)

select count(*) from observation_period;
count

11327617
(1 row)

select count(*) from payer_plan_period;
count

466954242
(1 row)

select count(*) from person;
count

10063600
(1 row)

select count(*) from procedure_occurrence;
count

1304437549
(1 row)

select count(*) from provider;
count

2979750
(1 row)

select count(*) from visit_detail;
count

2080257491
(1 row)

select count(*) from visit_occurrence;
count

1511930287
(1 row)

For the data server I have:

grep MemTotal /proc/meminfo
MemTotal: 7927900 kB

And for CPU
lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
Stepping: 7
CPU MHz: 2294.609
BogoMIPS: 4589.21
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 22528K
NUMA node0 CPU(s): 0-3
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_vnni md_clear flush_l1d arch_capabilities
[jcabrerazuniga@hsc-ctsc-achilles ~]$

I need to ask: Where is the Achilles execution log? Also, should I get a better server configuration?

Thanks

Thanks for the infrastructure specifications. Achilles processing is almost entirely execution of SQL statements on the database so all processing bottlenecks are likely associated with the database layer.

From what I can tell, your database server would likely perform better with more memory (currently only 8GB) I would recommend 32 or 64.

The log file will be the most helpful debugging tool at this point. It can be found in the output folder, which by default should be ‘output’, a subdirectory of the working directory where you executed Achilles. The file is named log_achilles.txt.

log_achilles.pdf (64.8 KB)

I just attached the logs

Thanks
Jose

t