OHDSI Home | Forums | Wiki | Github

Test CDM v5 dataset

@lee_evans has the test data and can direct you to it

Hi Sean,

On the unm-improvements branch, we have been actively making fixes to the
OHDSI ETL-CMS https://github.com/OHDSI/ETL-CMS/tree/unm-improvements
project, which is an ETL of publicly available Medicare SynPUF
https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html
simulated data (I think about 6-7 million patients).

We hope to commit the rest of our fixes on the next week or so, but we have
been able to successfully use this data with the various applications under
the Olympus umbrella.

Christophe

@Christophe_Lambert – thanks very much for improving it. I believe we may have a small test set that you could use, if you need one. We were not able to implement it in the original version. @aguynamedryan has it somewhere (it may already be on github). But regardless of whether you implement that, thanks for your help with the ETL.

to clarify what I mean by test set, I mean a set of test data to confirm that the data is going to the right table with the right information.

@Mark_Danese – glad we could help. One nice improvement that we already checked in is some more detailed documentation on how to run the ETL. There were a handful of patients that were hand-converted that we were able to use as test data that were quite helpful. Between that and using the actual CMS data for testing, we are pretty close to having a complete ETL. We want to bang it against the Achilles Heel tool first, however.

Great thanks! Looking forward to trying this out.

We uploaded the fully-implemented ETL this evening into the https://github.com/OHDSI/ETL-CMS/tree/unm-improvements branch. We are also preparing ready-to-load datasets so people can just get the data (and some smaller subsets) and not have to run the ETL themselves. @Patrick_Ryan is there a place to upload these on the OHDSI web site next week? It will be around 100GB uncompressed. @aguynamedryan, can we have your blessing to merge this branch with the master branch?

Thanks!

I don’t know if OHDSI has a folder for holding and downloading large files. If OHDSI/someone has a dropbox for business or similar product subscription, it could be hosted there.

By the way, thanks SO MUCH for doing this. This is great.

You are most welcome. Everyone should do an OMOP ETL at least once – there is no better way to learn the ins and outs of the vocabulary. Also, in the main README.md file we tried to acknowledge those who made earlier contributions to the project – I’m sure we are missing people – can someone check and let me know of anyone to add? I’m not sure who @claire-oi is.

I think you have everyone. Claire is Claire Cangialose at Outcomes Insights.

At the risk of having to ask for forgiveness, I merged our branch into the master branch this evening. It is ready to go.

I talked to Ryan. It is completely fine.

That’s awesome @christophe_lambert, thanks! @lee_evans can help with
storing a compressed version and post it on ohdsi.org so that folks can
download and play with it.

@Christophe_Lambert
@Christian_Reich has now setup his ftp server at ftp.ohdsi.org so you can work directly with Christian to upload the files.

For convenience, we have uploaded the pre-processed ETL files to the ohdsi FTP site in the synpuf folder. Further instructions on what the files are and how to load them into an OMOP CDMv5 database can be found at the beginning of the ETL-CMS/python_etl/README.md file.

For some reason the ftp links are not rendered in the README.md github markdown, so I’ve included them here below:

The data can be retrieved from this ftp folder. The file synpuf_1.zip (md5sum 0d11562053cec36999779cd5ae283c44) contains tables for the first 20th of the data (116,362 patients), and might be suitable for smaller-scale testing. The remaining 19 .csv.gz files represent the table data for all 20 parts (2,326,856 patients). Here are the direct links and md5sums for the files:

We hope this will serve as a useful resource for the community.

Christophe

A bug was found in the visit_occurrence table of the ETL by @sirpoovey and corrected, and I have uploaded new versions of visit_occurrence.csv.gz and synpuf_1.zip to the FTP site. If you have downloaded and loaded the data prior to this, you should only need to reload the visit_occurrence table.

Formerly the visit_concept_id for all visits was set to the concept for an inpatient visit (9201). Now visits from the inpatient source data have visit_concept_id set to 9201, visits from outpatient source data are set to 9202, and visits from carrier claims source data are set to 0, as we cannot distinguish between inpatient and outpatient visits for carrier claims data. We now retain versions of the ETL’d data within subdirectories at ftp://ftp.ohdsi.org/synpuf.

Someone was asking me today about more detail on the ETL-CMS code for the synpuf data, and I thought I’d provide a pointer to the OHDSI webcast I gave about it on July 5, 2016. Here is a link to the Webex recording: https://drive.google.com/file/d/0B3MHvw659x1kUUFyVVlLM0hRYzQ/view

To view this recording you would need to install the WebEx player for .ARF, files downloadable from here: https://www.webex.com/play-webex-recording.html

Not sure if this is the proper thread to reply to, but I’m having trouble running the CMS_SynPuf_ETL_CDM_v5.py script. I am trying to run it on the test data (DE_0). It does not generate any errors to the console, but it does not process any records, either. It appears to read through all of the OMOP concept files and create the dictionary; however, at the end of the script run, the following is outputted:
CMS_ETL done
Input Records------
File: beneficiary , records_read= 0
File: carrier , records_read= 0
File: inpatient , records_read= 0
File: outpatient , records_read= 0
File: prescription , records_read= 0
Output Records------
** done **

If I open up the directory where the output .csv files are supposed to reside, all of the files are present, but only the header line has been inserted. There are no records.

I am running the script (version 1.0.1 from github/master) using cygwin on Windows 7.

Have you followed step 4 of the instructions to set up the .env file to specify the paths appropriately?

If you have truly found a bug, or think the instructions need to be clarified, the issues tracker of the ETL-CMS github page would be a good place to post.

Christophe

t