We are now planning to convert the Korean national claim data into CDM covering 9 years data of 50 M patients.
As a feasibility test, we had finished converting 9 years of national claim data of 1M patients into CDM. From this data, we could estimate the amount of final data of the whole 50 M patients:
The sum of whole record count will be 120,000,000,000, and the DB file size will be 15 TB. Annually, 1.6TB of new data will be added to the DB.
I wonder how can we handle this kind of huge data, If there are 20 - 30 concurrent users for Atlas on the DB.
Recently I heard about Microsoft SQL Server Parallel Data Warehouse (PDW). Can this platform be a solution for it?
How much budget will be required for the system?
The project seems to start at 2018.
Hi @rwpark,
Sorry for the late response. We have been using PDW at JnJ for a while now, and have found it to be very fast, even for extremely large data sets like you have. Especially for things like the CohortMethod and PatientLevelPrediction packages, where we perform extensive feature construction on the data, we see completion time drop from many hours to mere minutes when compare to standard SQL Server. We have made sure all OHDSI tools run smoothly on PDW.
However, I know nothing about the required budget. I do know it wasn’t cheap.
Cheers,
Martijn
Thank you for the reply.
We requested to Microsoft for the PDW, and received a meeting date for it.
I head that they provide them as a full package including both H/W and S/W as quarter rack (3 servers) or half rack (6 servers).
If it progress well, I will let the OHDSI members know it (in person), if they want.
You may also check out IBM Netezza and Teradata. They also are hardware-accelerated platforms very similar to PDW. They all have their advantages and disadvantages. At least, you can always use them to put pressure on the price, if you decide to go with PDW.
Also, there is a whole working group using Hadoop. This is probably a cheaper solution. Look here: The new Working Group for Hadoop.
Hope this helps.