@lee_evans has the test data and can direct you to it
Hi Sean,
On the unm-improvements branch, we have been actively making fixes to the
OHDSI ETL-CMS https://github.com/OHDSI/ETL-CMS/tree/unm-improvements
project, which is an ETL of publicly available Medicare SynPUF
https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html
simulated data (I think about 6-7 million patients).
We hope to commit the rest of our fixes on the next week or so, but we have
been able to successfully use this data with the various applications under
the Olympus umbrella.
Christophe
@Christophe_Lambert – thanks very much for improving it. I believe we may have a small test set that you could use, if you need one. We were not able to implement it in the original version. @aguynamedryan has it somewhere (it may already be on github). But regardless of whether you implement that, thanks for your help with the ETL.
to clarify what I mean by test set, I mean a set of test data to confirm that the data is going to the right table with the right information.
@Mark_Danese – glad we could help. One nice improvement that we already checked in is some more detailed documentation on how to run the ETL. There were a handful of patients that were hand-converted that we were able to use as test data that were quite helpful. Between that and using the actual CMS data for testing, we are pretty close to having a complete ETL. We want to bang it against the Achilles Heel tool first, however.
Great thanks! Looking forward to trying this out.
We uploaded the fully-implemented ETL this evening into the https://github.com/OHDSI/ETL-CMS/tree/unm-improvements branch. We are also preparing ready-to-load datasets so people can just get the data (and some smaller subsets) and not have to run the ETL themselves. @Patrick_Ryan is there a place to upload these on the OHDSI web site next week? It will be around 100GB uncompressed. @aguynamedryan, can we have your blessing to merge this branch with the master branch?
Thanks!
I don’t know if OHDSI has a folder for holding and downloading large files. If OHDSI/someone has a dropbox for business or similar product subscription, it could be hosted there.
By the way, thanks SO MUCH for doing this. This is great.
You are most welcome. Everyone should do an OMOP ETL at least once – there is no better way to learn the ins and outs of the vocabulary. Also, in the main README.md file we tried to acknowledge those who made earlier contributions to the project – I’m sure we are missing people – can someone check and let me know of anyone to add? I’m not sure who @claire-oi is.
I think you have everyone. Claire is Claire Cangialose at Outcomes Insights.
At the risk of having to ask for forgiveness, I merged our branch into the master branch this evening. It is ready to go.
I talked to Ryan. It is completely fine.
That’s awesome @christophe_lambert, thanks! @lee_evans can help with
storing a compressed version and post it on ohdsi.org so that folks can
download and play with it.
@Christophe_Lambert
@Christian_Reich has now setup his ftp server at ftp.ohdsi.org so you can work directly with Christian to upload the files.
For convenience, we have uploaded the pre-processed ETL files to the ohdsi FTP site in the synpuf folder. Further instructions on what the files are and how to load them into an OMOP CDMv5 database can be found at the beginning of the ETL-CMS/python_etl/README.md file.
For some reason the ftp links are not rendered in the README.md github markdown, so I’ve included them here below:
The data can be retrieved from this ftp folder. The file synpuf_1.zip (md5sum 0d11562053cec36999779cd5ae283c44) contains tables for the first 20th of the data (116,362 patients), and might be suitable for smaller-scale testing. The remaining 19 .csv.gz files represent the table data for all 20 parts (2,326,856 patients). Here are the direct links and md5sums for the files:
-
839c0df1f625bff74aba3fed07e4375f
care_site.csv.gz -
fad02821bc7369385882b0fd403580e2
condition_occurrence.csv.gz -
3419aaa30fc9ebc7a605be7c5cf654fb
death.csv.gz -
4a5587d391763072c988d5c264d44b69
device_cost.csv.gz -
b60d19898934d17f0bc08e3a260e83f7
device_exposure.csv.gz -
37901c540feef6b8a4179d0e18438dae
drug_cost.csv.gz -
bbd07537a247aad7f690f71bfeabd6a6
drug_exposure.csv.gz -
40036fc2d6fe24378fd55158718e8a54
location.csv.gz -
bbd3c060b7ba2454f5bdd8cae589ca61
measurement_occurrence.csv.gz -
36b9525a151c95e9119c19dc96a94f5c
observation.csv.gz -
1cb344499f316b929aec4f117700511a
observation_period.csv.gz -
55b81fab86dc088443e0189ba4b70fdb
payer_plan_period.csv.gz -
3ab936bb7da41c4bc9c0dddf9daac42c
person.csv.gz -
5927a6509ef27e5f52c7ec1c3d86cbc9
procedure_cost.csv.gz -
1812775a95484646c1fd92d515e3b516
procedure_occurrence.csv.gz -
110c5fd05bc155eaa755e2e55ac7d0bf
provider.csv.gz -
207057ec59a57edf7596b12d393b0f63
specimen.csv.gz -
d48a8ab8155736d2a38c2feb7b82eb53
visit_cost.csv.gz -
e1540783c7d44987cb1a7008da0e1fc0
visit_occurrence.csv.gz
We hope this will serve as a useful resource for the community.
Christophe
A bug was found in the visit_occurrence table of the ETL by @sirpoovey and corrected, and I have uploaded new versions of visit_occurrence.csv.gz and synpuf_1.zip to the FTP site. If you have downloaded and loaded the data prior to this, you should only need to reload the visit_occurrence table.
Formerly the visit_concept_id for all visits was set to the concept for an inpatient visit (9201). Now visits from the inpatient source data have visit_concept_id set to 9201, visits from outpatient source data are set to 9202, and visits from carrier claims source data are set to 0, as we cannot distinguish between inpatient and outpatient visits for carrier claims data. We now retain versions of the ETL’d data within subdirectories at ftp://ftp.ohdsi.org/synpuf.
Someone was asking me today about more detail on the ETL-CMS code for the synpuf data, and I thought I’d provide a pointer to the OHDSI webcast I gave about it on July 5, 2016. Here is a link to the Webex recording: https://drive.google.com/file/d/0B3MHvw659x1kUUFyVVlLM0hRYzQ/view
To view this recording you would need to install the WebEx player for .ARF, files downloadable from here: https://www.webex.com/play-webex-recording.html
Not sure if this is the proper thread to reply to, but I’m having trouble running the CMS_SynPuf_ETL_CDM_v5.py script. I am trying to run it on the test data (DE_0). It does not generate any errors to the console, but it does not process any records, either. It appears to read through all of the OMOP concept files and create the dictionary; however, at the end of the script run, the following is outputted:
CMS_ETL done
Input Records------
File: beneficiary , records_read= 0
File: carrier , records_read= 0
File: inpatient , records_read= 0
File: outpatient , records_read= 0
File: prescription , records_read= 0
Output Records------
** done **
If I open up the directory where the output .csv files are supposed to reside, all of the files are present, but only the header line has been inserted. There are no records.
I am running the script (version 1.0.1 from github/master) using cygwin on Windows 7.
Have you followed step 4 of the instructions to set up the .env file to specify the paths appropriately?
If you have truly found a bug, or think the instructions need to be clarified, the issues tracker of the ETL-CMS github page would be a good place to post.
Christophe