@Mark_Danese made a good suggestion on the OHDSI call today: we could try to use the CMS Medicare Synthetic dataset that they make available.
I read the documentation and downloaded the first part of the sample, and it looks like it could be quite useful to meet your needs. You could successfully populate many OMOP CDM (v4 or v5) tables: PERSON, OBSERVATION_PERIOD, CONDITION_OCCURRENCE, DRUG_EXPOSURE, DEVICE_EXPOSURE, PROCEDURE_OCCURRENCE, VISIT_OCCURRENCE, DRUG_COST, PROCEDURE_COST, PROVIDER, CARE_SITE (and then derive CONDITION_ERA and DRUG_ERA). The only tables that would be left unpopulated in CDM would be SPECIMEN, MEASUREMENT, NOTE, and OBSERVATION.
Based on this raw data, I think you could apply all of the OHDSI tools:
0a) Download the CMS Synthetic data into an Oracle/Postgres/SQL Server database (files come as .csv)
0b) Instantiate an empty OMOP CDM schema and load the vocabulary
- Use WhiteRabbit to profile the source data
- Use RabbitInAHat to produce an ETL specification document that outlines the mapping from the source data to the OMOP CDM (v4 or v5)
- Write an ETL in whatever language you like best (since it’s basic claims data, the ETLs already publicly available on omop.org/CDM and through OHDSI github repo would be a very useful starting point)
- Use Usagi to map any local codes that aren’t already in the OMOP Vocabulary (offhand, I didn’t see any codes that require mapping since conditons are ICD9, drugs are in NDC, procedures are in CPT4/HCPCS/ICD9)
- Run ACHILLES to profile the database and perform data quality assessment
@schuemie prepared a very useful summary of the process.
- Run the Treatment pathway code so you’ve got a working example of how to run a R script that connects to your database and produces standardized summary results (I would expect you to get 0 records back, because the sample data is for 3 years, but at least you’ve perform the proof-of-concept that code can execute without error).
then, in short order, you’ll be able to apply other OHDSI analytical tools as they get released, for example:
a. the cohort definition WG is preparing the first release of CIRCE, which will allow parameterized SQL to select the population of interest based on your defined inclusion/exclusion criteria
b. the cohort characterization WG has started development in HERACLES, to perform descriptive statistics like what ACHILLES performs, based on the selection of a cohort (such as what CIRCE would produce)
c. the population-level estimation WG has several analytical methods under development as R packages, including CohortMethod, SelfControlledCohort, SelfControlledCaseSeries, ICTemporalPatternDiscovery.
This topic of getting access to simulated data to facilitate development has been a recurring theme for several folks. Is anyone interested in taking on this body of work for the community now that @Mark_Danese has identified a good starting point for us to work from ?
Cheers,
Patrick