OHDSI Home | Forums | Wiki | Github

Hardware specs to run OHDSI technology stack

Hi all,

I am evaluating hardware setups for the CDM and OHDSI tools in our Department. What I was thinking about is a central server that hosts the CDM and vocabularies (read-only) and add a writable scratch schema there (as is done in JnJ). The client boxes do the computer intensive stuff like running the Patient-Level Prediction studies etc.

Is there any advice from those that have created this setup. I am aware of the JnJ setup from @Patrick_Ryan but I am curious how others are doing this in their institutes.

  1. What are the specs needed for the database server to have a reasonable fast response for let’s see up to 5 concurrent users? I know the Parallel Datawarehouse as used in JnJ is working great but what is the alternative with a lower price tag?

  2. We are currently on Postgres but extraction are quite slow. Would switching to SQL server be better or is this only related to the hardware running the database?

  3. The client machines could probably just be the high consumer version with 32 Gb Memory, 1T SSD drives? any other things to consider? Minimum number of cores?

  4. I am also very interested in adding GPU power to the stack since we will be working on deep learning algorithms as of Dec 1th with a new PostDoc. @dsontag @razavian @msuchard and suggestions related to machines that can host nvidia GPUs? What are good GPU cards to buy?

All input is highly appreciated.

Thanks,

Peter

Hi Peter,

Have you considered looking at AWS as an option? While probably not the cheapest option as a start but there are multiple reasons why you might want to consider that platform and why it could be a good investment:

  1. You can easily use it to play with properly sizing the infrastructure for your needs and have many options - from EC2 instances to RDS and RedShift etc… You can try different configurations to find which one will work for your needs. You could then lock into a reserved instance to get a much better pricing or even re-create a similar configuration on-premise.

  2. You can spin off your infrastructure in a matter of minutes. You can also shut it down when you are not using it.

  3. AWS offers a very complete toolbox that is needed to enable the effective creation and use of the CDM - from options for CDM conversions, hosting CDM and Vocabs, to big data processing on data analytics side.

  4. Scalability, elasticity, availability, choice of locations, pay-per-use, scripted infrastructure, and many other benefits

But again, even if you decide to go with an on-premise solution - at least it could help you to size your platform properly based on your real needs.

I am also happy happy to give you a more general on-premise set up options as well. What is the data size of your CDM data?

Hi Peter,

UCDenver has gone with Google BigQuery for their OMOP data. Haven’t solved the issue of how it’s going to work with WebAPI or other OHDSI tools, but apparently they determined that it is very cheap and fast compared to alternatives. (And I’d be thrilled if we weren’t the only ones trying to get it to work as a supported platform :smile:)

Cheers,
Sigfried

Thanks Greg.

Yes, I am aware of the flexibility of the cloud solutions. I am actually trying out Azure SQL Servers at the moment to see of this could be used to host the CDM and a connected virtual machine with Atlas. This would be a great solution to host a database we could share for method development. So I am certainly interested in AWS experience as well.

However this is not really an option for our internal infrastructure and our database because of governance issues. We are not even allowed to have an internet connection to the database which is quite annoying actually and unique for all the many databases I have worked with. It will take some time before it is seen that putting the data in the cloud is not by definition less secure…

Our database contains approx 2 million patients so it is a small size database, but we do have a lot of textual information in the notes field that we like to start processing with text mining tools. I do not want any technical limitations being the bottle neck.

So if you have ideas for on-premises solutions please let me know.

HI Sigfried,

Thanks I am not familiar with Google BigQuery but I will definitely have a look at it.

However, a cloud solution is not the possible for us at the moment (see my reply to Greg).

Thanks,

Peter

Hi Peter,

This is our new server spec:

Server Type: DELL POWEREDGE R730XD
Processors: INTEL XEON E5-2667V4 3.2GHZ x 2 CPU
Memory: 768GB, 24 x 32GB PC4-17000P DDR4-2133 REGISTERED ECC MEMORY
Hard Drive:
18 X DELL 1.6TB MLC SAS III SSD 2.5 INCH ENTERPRISE CLASS 12GB/S SSD

  • 2 x DELL 300GB 10K SAS 2.5" HDD
    Raid Controller: PERC H730 1GB CACHE 12GB/S RAID CONTROLLER
    Management: IDRAC8 ENTERPRISE
    Networking:Dell Intel 2 x 10GBE & 2 x 1GBE Rack Network Daughter Card - 99GTM
    Power Supply: DUAL 1100W POWER SUPPLY
    OS: Windows server 2012 standard
    DB: MSSQL 2016 standard

This server is awesome! One and half hour work with the previous server was done within 17 seconds with this server.
Disk I/O and DB indexing seem the key factor for speed up. Multiple NVMe SSD or multiple SSD with RAID 5 with 12Gbps interface can dramatically reduce the time for disk I/O.
For the DB indexing, please refer to : Incremental Achilles?

I purchased the server (repub) from ebay, and added more RAMs and disks.

Supermicro has many GPU machines.
We recently purchased a SuperServer 4028-TR, which can hold up to 8 GPUs in a 4 U (actually 5U) server: https://www.supermicro.com.tw/products/system/4U/4028/SYS-4028GR-TR.cfm

Another server which we had been used is 7048GR-TR, which can hold up to 4 GPUs in a 4 U server.
https://www.supermicro.com.tw/products/system/4U/7048/SYS-7048GR-TR.cfm
The nVidia titan x pascal seems one of the most reasonable GPU in considering the price and performance.

t