Test CDM v5 dataset

Brandon_Ulrich · January 13, 2015, 11:20am

Hi all,

First, thanks for all the work that you’ve been doing over the years and the tooling that you’ve provided to the community. We’ve been looking through the work over the last week and are quite impressed.

We’re planning to add support for the CDM v5 and are looking for some sample data to use for testing. The only test datasets that I could find are the OSIM2 datasets based on the CDM v2 structure.

Is there an equivalent or similar set of test data available in CDM v5 format? Or any public set of test data that can be used to validate our CDM 5 implementation?

Thanks in advance,
Brandon

wstephens · January 13, 2015, 3:05pm

Brandon,

I’m sure that most of us are in the same boat. We need sample data sets with a known signal embedded in them for a variety of patient populations, disease frequencies and outcomes.

WhiteRabbit: I considered the following plan with WhiteRabbit:

Point WhiteRabbit at a CDM v5 DB containing data
Add the CDM tables to the scan criteria
Run the scan to generate the Excel output
Manipulate the Excel: adjust the distribution of data, obscure info
Import the edited Excel into WhiteRabbit
Execute fake data generation to output CSV files that can be loaded to CDM
Load the data, conceptually map

Issue: I fear that the referential integrity in the fake data set will be poor
Fake data generation doesn’t care if it assigns a visit to a dead patient or gives cervical cancer to a male.

Public data?
I have been looking for public data sets that might serve in this space. I leveraged The public Cancer Genome Atlas data to load CDM v4 as a cancer reference data set. The sparseness of the data and poor data quality became a problem, although I was finally able to get much of it loaded and mapped to enable Achilles to run and produce some visualizations. I have not yet updated this for v5.

Are there other public data sets that might suffice?

Bill

Patrick_Ryan · January 15, 2015, 3:31pm

@Mark_Danese made a good suggestion on the OHDSI call today: we could try to use the CMS Medicare Synthetic dataset that they make available.

I read the documentation and downloaded the first part of the sample, and it looks like it could be quite useful to meet your needs. You could successfully populate many OMOP CDM (v4 or v5) tables: PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE, DRUG_EXPOSURE, DEVICE_EXPOSURE, PROCEDURE_OCCURRENCE, VISIT_OCCURRENCE, DRUG_COST, PROCEDURE_COST, PROVIDER, CARE_SITE (and then derive CONDITION_ERA and DRUG_ERA). The only tables that would be left unpopulated in CDM would be SPECIMEN, MEASUREMENT, NOTE, and OBSERVATION.

Based on this raw data, I think you could apply all of the OHDSI tools:

0a) Download the CMS Synthetic data into an Oracle/Postgres/SQL Server database (files come as .csv)
0b) Instantiate an empty OMOP CDM schema and load the vocabulary

Use WhiteRabbit to profile the source data
Use RabbitInAHat to produce an ETL specification document that outlines the mapping from the source data to the OMOP CDM (v4 or v5)
Write an ETL in whatever language you like best (since it’s basic claims data, the ETLs already publicly available on omop.org/CDM and through OHDSI github repo would be a very useful starting point)
Use Usagi to map any local codes that aren’t already in the OMOP Vocabulary (offhand, I didn’t see any codes that require mapping since conditons are ICD9, drugs are in NDC, procedures are in CPT4/HCPCS/ICD9)
Run ACHILLES to profile the database and perform data quality assessment

@schuemie prepared a very useful summary of the process.

Run the Treatment pathway code so you’ve got a working example of how to run a R script that connects to your database and produces standardized summary results (I would expect you to get 0 records back, because the sample data is for 3 years, but at least you’ve perform the proof-of-concept that code can execute without error).

then, in short order, you’ll be able to apply other OHDSI analytical tools as they get released, for example:
a. the cohort definition WG is preparing the first release of CIRCE, which will allow parameterized SQL to select the population of interest based on your defined inclusion/exclusion criteria
b. the cohort characterization WG has started development in HERACLES, to perform descriptive statistics like what ACHILLES performs, based on the selection of a cohort (such as what CIRCE would produce)
c. the population-level estimation WG has several analytical methods under development as R packages, including CohortMethod, SelfControlledCohort, SelfControlledCaseSeries, ICTemporalPatternDiscovery.

This topic of getting access to simulated data to facilitate development has been a recurring theme for several folks. Is anyone interested in taking on this body of work for the community now that @Mark_Danese has identified a good starting point for us to work from ?

Cheers,

Patrick

donohara · January 14, 2015, 3:34am

Hi Patrick - I can help out here. It will be a great way for me to come up to speed with the changes the group has made to the models and tools since I last worked with the CDM and tools, and start to contribute going forward.

Don

jon_duke · January 20, 2015, 3:20am

Don,

I would love to hear any progress on this. I think there will be a lot of people eager to help out and provide input regarding this synthetic dataset creation process.

Jon

Mark_Danese · January 20, 2015, 4:17am

We did our original partial ETL on V4 for the synthetic Medicare data. Since then we have done a more comprehensive version for the Medicare data used in the SEER Medicare linked data (slightly different structure and variable names). I will see what we have in the way of documentation and code that we can share. Most of this was done in SAS. Let me get our documents together to use as a starting point.

donohara · January 20, 2015, 12:17pm

We also did a quick v5 dataset from the synthetic Medicare data. I need to review our steps and dust it off (it was a very quick exercise to get some data). I can look at your code and notes, Mark, and come up with an up-to-date version of the conversion scripts.

MIn_jiang · April 23, 2015, 3:47pm

We are also looking for the test data set, so is there any existing CDM 5 dataset based on the synthetic Medicare data?

vhapbmthakkb · August 27, 2015, 2:39pm

HI Patrick,
I have started to look into ACHILLES analytical tool for profiling CDM v4.0 data. We the CDM data as SAS dataset can we use this tool or do we need to convert them to SQL tables before using? Thank for your advice. Bharat Thakkar
VA MedSafe Hines, IL

Patrick_Ryan · August 27, 2015, 2:57pm

Hi @vhapbmthakkb, You need to put your CDM data into a relational database
system in order to take advantage of the OHDSI tools. OHDSI currently
supports PostresQL (which is open-source and free), MS SQL Server, Oracle,
MS APS, and also works on AWS Redshift. You may want to check in with
@sduvall and @mmatheny and @bcs about which RDBMS they are using it other
parts of the VA.

boxysean · April 11, 2016, 3:58pm

Hi all,

I’m also interested in a test data set for CDM 5. Has any progress been made on this since the thread began?

Thanks!

Mark_Danese · April 11, 2016, 4:18pm

@lee_evans has the test data and can direct you to it

Christophe_Lambert · April 11, 2016, 4:29pm

Hi Sean,

On the unm-improvements branch, we have been actively making fixes to the
OHDSI ETL-CMS https://github.com/OHDSI/ETL-CMS/tree/unm-improvements
project, which is an ETL of publicly available Medicare SynPUF
https://www.cms.gov/Research-Statistics-Data-and-Systems/Downloadable-Public-Use-Files/SynPUFs/DE_Syn_PUF.html
simulated data (I think about 6-7 million patients).

We hope to commit the rest of our fixes on the next week or so, but we have
been able to successfully use this data with the various applications under
the Olympus umbrella.

Christophe

Mark_Danese · April 11, 2016, 4:41pm

@Christophe_Lambert – thanks very much for improving it. I believe we may have a small test set that you could use, if you need one. We were not able to implement it in the original version. @aguynamedryan has it somewhere (it may already be on github). But regardless of whether you implement that, thanks for your help with the ETL.

Mark_Danese · April 11, 2016, 4:43pm

to clarify what I mean by test set, I mean a set of test data to confirm that the data is going to the right table with the right information.

Christophe_Lambert · April 11, 2016, 6:57pm

@Mark_Danese – glad we could help. One nice improvement that we already checked in is some more detailed documentation on how to run the ETL. There were a handful of patients that were hand-converted that we were able to use as test data that were quite helpful. Between that and using the actual CMS data for testing, we are pretty close to having a complete ETL. We want to bang it against the Achilles Heel tool first, however.

boxysean · April 11, 2016, 7:50pm

Great thanks! Looking forward to trying this out.

Christophe_Lambert · June 25, 2016, 12:11am

We uploaded the fully-implemented ETL this evening into the https://github.com/OHDSI/ETL-CMS/tree/unm-improvements branch. We are also preparing ready-to-load datasets so people can just get the data (and some smaller subsets) and not have to run the ETL themselves. @Patrick_Ryan is there a place to upload these on the OHDSI web site next week? It will be around 100GB uncompressed. @aguynamedryan, can we have your blessing to merge this branch with the master branch?

Thanks!

Mark_Danese · June 25, 2016, 12:16am

I don’t know if OHDSI has a folder for holding and downloading large files. If OHDSI/someone has a dropbox for business or similar product subscription, it could be hosted there.

Mark_Danese · June 25, 2016, 12:16am

By the way, thanks SO MUCH for doing this. This is great.