Standard CDM database for testing / demonstrating

(Martijn Schuemie) #1

(Credit for this idea goes to @Patrick_Ryan)

In R we have the iris and cars datasets that are used in many R examples. I would be very helpful to have something similar for OHDSI that could be used in the examples of our various R packages, and in The Book of OHDSI. The data set should adhere to the CDM, be small, but still be able to demonstrate various complex functions, such as running CohortMethod and PatientLevelPrediction.

The steps to create such a dataset could be:

  1. Start with an existing simulated dataset, such as Synpuf or Synthea. Alternatively, we could take a real dataset and pertube it beyond the ability to reconstruct the original data.

  2. Filter to a subpopulation to reduce size while maintaining the ability to run the various use cases. Since CohortMethod is probably the most restrictive, we could restrict the population to those needed for the CohortMethod vignette (new users of diclofenac or celecoxib).

  3. Filter to a subset of concepts to reduce size even further. This dataset will need a copy of the vocab tables, which are relatively large if unfiltered. If we filter everything to say the 100 most prevalent concepts in each domain in the dataset after step 2, we will have reduced the size of the embedded vocabulary by a huge fraction, and the size of the dataset by a fair chunk.

  4. Embed this CDM database (including filtered vocab) in an R package.

Ideally, the end result would just be a few megabytes.

For step 4, one idea would be to use the RSQLite package, and store the dataset inside our dataset package in a SQL-queryable database. In theory, this code could work without installing any other software, and without having access to a database server:

connection <- connect(dbms = "sqllite", server = "OhdsiDataset")
querySql(connection, "SELECT COUNT(*) FROM person;")

Let me know what you think of this idea. What would be a good starting point (step 1)?

Also, if we agree this is a good idea, I could really use some help (meaning someone willing to write and test steps 1-3 in R, so I can focus on 4).

(Seng Chan You) #2

@schuemie Of course, I love this idea. This is so exciting idea! :grin:
Although we converted Synthea database into OMOP-CDM, I don’t think we can find suitable patients set such as diclofenac or celecoxib in Synthea. The granularity of condition in Synthea is too vague. And we cannot make a full essential tables related with visit or procedure by using Synthea. So, SynPuf DB would be better.

(Martijn Schuemie) #3

Does anyone have a cool greek name for such a data resource?

(Christian Reich) #4

Demeter - Goddess of agriculture. Takes care of soil, crops, harvesting.

(Roger Carlson) #5


“Goddess of good health, cleanliness, and sanitation. This is where the word “hygiene” comes from.”

It could represent a “clean” dataset.

(Christian Reich) #6

We may need that one for the THEMIS police (= All data quality queries, Achilles Heel, THEMIS checks put together).

(Roger Carlson) #7

or Eunomia, minor Greek goddess of law and legislation (her name can be translated as “good order”, “governance according to good laws”) She is by most accounts the daughter of [Themis] and [Zeus].

(Christian Reich) #8

Nice idea.

(Martijn Schuemie) #9

Ok, Eunomia it is! Thanks!

(Frank DeFalco) #10

We have recently completed a ETL for Synthea to CDM 5.3 which can now be found here:

I am not sure that it will fill all of our needs but it provides a way to bridge the capabilities of the Synthea simulator and our OHDSI tool stack. We plan on creating a number of data sets using Synthea and using them for training data sets for other activities. This will also be used for the upcoming tutorials at the EU symposium.

My understanding of the Synthea simulator is that it allows us to develop modules that can be used to simulate a variety of sequences of care. These modules appear to support robust use cases and allow us to specify conditions in SNOMED and drugs in RxNorm, falling nicely in line with our standards. Here is an example of one of their pre-defined modules, of which there are several dozen. Would it be useful if we created a module for diclofenac and celecoxib to test Synthea’s ability to generate meaningful data for our purposes?

(Martijn Schuemie) #11

Hi @Frank. I like the idea of using Synthea for this purpose, but could use your help. Here are the requirements I have for Eunomia:

  1. Be able to run some CohortMethod example (for example the diclofenac - celecoxib - GI bleed study in the vignette)

  2. Be able to run a PLP example (e.g. diclofenac users - GI bleed?)

  3. Be able to run ACHILLES

  4. Be able to demonstrate various other CDM properties, such as searching by source codes and source concepts.

  5. When loaded into a SQLite database, the whole CDM including vocab should ideally be less than 7MB (zipped), or at least less than 15MB (zipped)

The most demanding is number 1. We need enough people in T, C, and O to fit the propensity and outcome models. Moreover, propensity score matching / stratification only works if the covariates are not sampled independently.

Could you help set up Synthea in way that produces a set meeting these requirements?