OHDSI Home | Forums | Wiki | Github

Standard CDM database for testing / demonstrating

(Credit for this idea goes to @Patrick_Ryan)

In R we have the iris and cars datasets that are used in many R examples. I would be very helpful to have something similar for OHDSI that could be used in the examples of our various R packages, and in The Book of OHDSI. The data set should adhere to the CDM, be small, but still be able to demonstrate various complex functions, such as running CohortMethod and PatientLevelPrediction.

The steps to create such a dataset could be:

  1. Start with an existing simulated dataset, such as Synpuf or Synthea. Alternatively, we could take a real dataset and pertube it beyond the ability to reconstruct the original data.

  2. Filter to a subpopulation to reduce size while maintaining the ability to run the various use cases. Since CohortMethod is probably the most restrictive, we could restrict the population to those needed for the CohortMethod vignette (new users of diclofenac or celecoxib).

  3. Filter to a subset of concepts to reduce size even further. This dataset will need a copy of the vocab tables, which are relatively large if unfiltered. If we filter everything to say the 100 most prevalent concepts in each domain in the dataset after step 2, we will have reduced the size of the embedded vocabulary by a huge fraction, and the size of the dataset by a fair chunk.

  4. Embed this CDM database (including filtered vocab) in an R package.

Ideally, the end result would just be a few megabytes.

For step 4, one idea would be to use the RSQLite package, and store the dataset inside our dataset package in a SQL-queryable database. In theory, this code could work without installing any other software, and without having access to a database server:

connection <- connect(dbms = "sqllite", server = "OhdsiDataset")
querySql(connection, "SELECT COUNT(*) FROM person;")

Let me know what you think of this idea. What would be a good starting point (step 1)?

Also, if we agree this is a good idea, I could really use some help (meaning someone willing to write and test steps 1-3 in R, so I can focus on 4).


@schuemie Of course, I love this idea. This is so exciting idea! :grin:
Although we converted Synthea database into OMOP-CDM, I don’t think we can find suitable patients set such as diclofenac or celecoxib in Synthea. The granularity of condition in Synthea is too vague. And we cannot make a full essential tables related with visit or procedure by using Synthea. So, SynPuf DB would be better.

1 Like

Does anyone have a cool greek name for such a data resource?

Demeter - Goddess of agriculture. Takes care of soil, crops, harvesting.


“Goddess of good health, cleanliness, and sanitation. This is where the word “hygiene” comes from.”

It could represent a “clean” dataset.

We may need that one for the THEMIS police (= All data quality queries, Achilles Heel, THEMIS checks put together).

or Eunomia, minor Greek goddess of law and legislation (her name can be translated as “good order”, “governance according to good laws”) She is by most accounts the daughter of [Themis] and [Zeus].

Nice idea.

Ok, Eunomia it is! Thanks!

1 Like

We have recently completed a ETL for Synthea to CDM 5.3 which can now be found here:

I am not sure that it will fill all of our needs but it provides a way to bridge the capabilities of the Synthea simulator and our OHDSI tool stack. We plan on creating a number of data sets using Synthea and using them for training data sets for other activities. This will also be used for the upcoming tutorials at the EU symposium.

My understanding of the Synthea simulator is that it allows us to develop modules that can be used to simulate a variety of sequences of care. These modules appear to support robust use cases and allow us to specify conditions in SNOMED and drugs in RxNorm, falling nicely in line with our standards. Here is an example of one of their pre-defined modules, of which there are several dozen. Would it be useful if we created a module for diclofenac and celecoxib to test Synthea’s ability to generate meaningful data for our purposes?

1 Like

Hi @Frank. I like the idea of using Synthea for this purpose, but could use your help. Here are the requirements I have for Eunomia:

  1. Be able to run some CohortMethod example (for example the diclofenac - celecoxib - GI bleed study in the vignette)

  2. Be able to run a PLP example (e.g. diclofenac users - GI bleed?)

  3. Be able to run ACHILLES

  4. Be able to demonstrate various other CDM properties, such as searching by source codes and source concepts.

  5. When loaded into a SQLite database, the whole CDM including vocab should ideally be less than 7MB (zipped), or at least less than 15MB (zipped)

The most demanding is number 1. We need enough people in T, C, and O to fit the propensity and outcome models. Moreover, propensity score matching / stratification only works if the covariates are not sampled independently.

Could you help set up Synthea in way that produces a set meeting these requirements?

Hi @schuemie, and all. I am a PhD student at UC Berkeley in Statistics, and am hoping to become a new contributor to the OHDSI population-estimation team. In trying to catch myself up to speed, I have read through the vignettes but was searching for exactly this: a sample database that I could query and use to experiment hands-on with the code. To that end, I’d love to help build this existing simulated dataset if it has not yet been created.

Can we connect to talk about details? For instance, is this the SynPuf DB that is being referenced? https://drive.google.com/file/d/18EjMxyA6NsqBo9eed_Gab1ESHWPxJygz/view I’d also like to know what the schema of the final product is expected to look like. The vignettes, for instance, don’t seem to require a vocab table, but this seems to be something you’d like to be included, based on #3 and #4 of your original Feb 1 post.

Hi @olivia!

Yes, currently if you want to play with the OHDSI tools but don’t have data in the OMOP CDM yourself your best bet is the Synpuf data you referenced. Note that the Synpuf files do not contain the vocabulary tables, but these are an essential part of the CDM. Even though the Methods Library vignette may not directly reference the vocab tables (although I think there are some joins to concept_ancestor in some SQL here and there), they are used extensively by the tools themselves. For example, FeatureExtraction which is used by CohortMethod uses the hierarchy in the vocab to create condition and drug groupings to build features. So in addition to the Synpuf files you will need to fetch the Vocab files from Athena. Upload the Synpuf and Vocab tables into a single schema, and you should have a fully function instance of the Common Data Model.

@Frank is making good progress on creating a self-contained database using Synthea that would make this all a little easier, but it is not yet complete.


@schuemie Last year I had talked with SNOMED and they are OK with having a very small reduced version of their terminology that is tailored to only show very specific things be released publicly, with an appropriate README that discusses that it is “fair use” with IDHSO’s copyright.

So, if we did proceeded with this very minimal set, we could perhaps have it focus only upon the hypertension use case found in the BookOfOHDSI. We could then find pruned SymPUF data that activated the various permutations we wish to show.

In addition to a showcase, it’d be useful for basic application regression testing, for at least basic screens and walk though functionality.

@schuemie Do you have an initial “Eunomia” SQL database dump that is sharable? What I’d prefer to do is export the data and directly check it into a separate repository, one CSV file for each of the tables we have sample data for. Then, we could use it to populate PostgreSQL or any other OHDSI database. Further, these CSV files could then be updated over time to reflect new demonstration and testing needs without having to have a source database to pull from.

Although I’m not a lawyer, there is “fair use” argument for using a handful of terminology codes, so they could be included directly in the repository as well. IHTSDO has previously told me that they would be OK with this fair-use practice, they’d just prefer to review the exact codes used to ensure they agree that it’s fair use. Hence, getting this code set we want to use over to IHTSDO is pretty important.

Hello. In order to get traction on understanding OHDSI SQL queries, I’ve created a tiny 10-patient subset database which I believe has small enough vocabulary to be openly shared yet actually produces results when cohort SQL queries are run. I’ve asked IHTSDO in order to get their blessing, but have not yet received confirmation from them; I also hope to see if they’d permit a broader code set

For now, I’ve named the project SynPUF-HCFU for “Hand Crafted, Fair Use”. I’d love to have any feedback.

To use the database, you need only download the “synpuf-10p.sql” file at the top of the repository. I’ve aslo copied over the following cohorts from Atlas database, included the generated SQL for PostgreSQL, which is stored in the folder “cohort”.

 cohort_definition_id |                             cohort_definition_name                              
              1770673 | Angioedema events
              1770674 | Acute myocardial infarction events
              1770675 | New users of ACE inhibitors as first-line monotherapy for hypertension
              1770676 | New users of Thiazide-like diuretics as first-line monotherapy for hypertension

Then, I copied over de minimis data from SynPUF 5% for 10 individuals that met 74-76; bringing over the minimum set of records that permit these cohort queries to work successfully. Unfortunately, it seems that SynPUF 5% doesn’t have any hits for 1770673.

synpuf-10p=# select * from cohort;
 cohort_definition_id | subject_id | cohort_start_date | cohort_end_date 
              1770674 |      95538 | 2009-03-30        | 2009-04-06
              1770674 |     110862 | 2009-09-30        | 2009-10-07
              1770674 |      69985 | 2010-07-22        | 2010-07-29
              1770674 |       1780 | 2008-04-10        | 2008-04-17
              1770674 |     107680 | 2009-07-20        | 2009-07-27
              1770674 |      30091 | 2009-08-02        | 2009-08-09
              1770675 |      82328 | 2009-08-24        | 2009-09-23
              1770675 |      30091 | 2009-03-28        | 2009-04-27
              1770675 |     110862 | 2010-04-05        | 2010-05-05
              1770675 |      69985 | 2009-05-05        | 2009-06-04
              1770675 |      42383 | 2009-11-06        | 2009-12-06
              1770676 |      95538 | 2010-01-20        | 2010-02-19
              1770676 |       1780 | 2009-03-02        | 2009-04-01
              1770676 |     107680 | 2009-07-06        | 2009-08-05
              1770676 |      37455 | 2009-08-17        | 2009-09-16
              1770676 |      72120 | 2009-02-25        | 2009-03-27
(16 rows)

The vocabulary used to make these 3 cohorts work is rather small, all things considered. I’m sort of cheating. Since OHDSI’s CONCEPT_ANCESTOR table flattens the concept hierarchy, I need only include 3 kinds of records: a) exact, but relevant concepts used in the cohort definitions, b) exact concepts used in the patient data that matches what’s needed to trigger the cohorts, and c) direct links associating them via the ancestor table. In particular, the concept hierarchy can be bypassed. This isn’t ideal, I’ve queried IHTSDO to see if I could include hierarchical elements as well.

synpuf-10p=# select vocabulary_id, count(*) from concept group by vocabulary_id order by count(*) desc;
    vocabulary_id     | count 
 Concept Class        |    24
 ICD9CM               |    22
 RxNorm               |    20
 Vocabulary           |    19
 NDC                  |    13
 SNOMED               |    12
 Domain               |     9
 Visit                |     4
 CMS Place of Service |     3
 Race                 |     2
 Condition Type       |     2
 Relationship         |     2
 Gender               |     2
 None                 |     1
 Ethnicity            |     1
 Visit Type           |     1
 Drug Type            |     1
 Obs Period Type      |     1
(18 rows)

Anyway. I’m sure there are much better ways to go about doing this, however, as I work on translating SQL queries into DataKnots, I need a regression test database. In particular, I will need to extend the database to include patients that don’t trigger the cohort definitions, ideally one patient for each kind of case to ensure that the cohort definitions are clearly working with boundary conditions. Over time, if this is successful, the database may grow as needed to accommodate a translation of these and the common OHDSI queries.

I’d like to ask for a bit of help with regard to cohort 1770673|Angioedema events so this fair use test database could be used in conjunction with The Book Of OHDSI. While SynPUF 5% has hits for the other 3 cohorts used in the book, there’s no data that matches this nor some expected relationship with this and other cohorts. What would a minimal person look like if they triggered this cohort, yet overlapped with the other cohorts in the book: 1770675|ACE inhibitors, 1770676|Thiazide-like diuretics, and 1770674|Acute myocardial infarction? For a hand-crafted database for testing/demonstration, I see no problem in adding completely fake individuals that show what we wish and help us validate our cohort queries.


I am not sure I am following. Why do you want to create a mini-dataset? Synpuf and the other synthetic databases (Synthea, Mimic) have exactly that purpose.

Christian, I think Martijn outlined the primary goals very well at the top of this thread. There is a need for a very small OHDSI testing data set that could be used to demonstrate techniques and validate implementations. I would add one additional criteria important to me: I need one for regression testing – I’m not a fan of writing code unless I’m writing coverage tests at the same time. The current data-sets don’t meet these needs.

  • For automated regression tests, the data-sets must be small enough to easily copy and spin-up. Furthermore, they can’t have redistribution constraints that require click-though and need to be publicly downloadable. When I commit, the public must be able to see the past/fail, or it’s not an open project that could collect collaborators.

  • The examples in the Book of OHDSI should be testable using this minimal dataset, so that someone could easily download the database and verify they understand the core of the system. This means that the dataset should provide test data for cases that are covered in the book, such as the Angioedema cohort (which doesn’t have hits in SynPUF).

This necessitates that care must be taken – there’s an element of hand-crafting test cases so that they illustrate what you need to show, but don’t overwhelm the audience. This also means that the vocabularies must be very tiny, so they fall clearly within “fair use” doctrine. The larger your database, the closer you get toward violating the fair-use safe harbor.

Generally, I really like the idea of using Synthea to make these datasets. However, I’m on a very tight deadline, and no test database was available to me this week. I don’t even like how I made the database… it was brute force. But, I can refactor that later. Where I’m currently stuck, since I lack any clinical expertise, is that I don’t know what sort of minimal patient data would be needed for the Angioedema cohort and how it fits into the narrative outlined in the Book of OHDSI.

P.S. The only place where I’d differ from Martijn is that I need this database to be usable without requiring “R” / Java tooling. Hence, if what others produce can only use it by incorporating a large set of dependencies, it will not meet my regression test requirement. I’d also prefer SQLite over PostgreSQL, but for the time being, my project has a hard requirement on PostgreSQL. Even so, Martijn is quite correct, there should be a SQLite version of the test database.