(Credit for this idea goes to @Patrick_Ryan)
In R we have the iris and cars datasets that are used in many R examples. I would be very helpful to have something similar for OHDSI that could be used in the examples of our various R packages, and in The Book of OHDSI. The data set should adhere to the CDM, be small, but still be able to demonstrate various complex functions, such as running CohortMethod and PatientLevelPrediction.
The steps to create such a dataset could be:
-
Start with an existing simulated dataset, such as Synpuf or Synthea. Alternatively, we could take a real dataset and pertube it beyond the ability to reconstruct the original data.
-
Filter to a subpopulation to reduce size while maintaining the ability to run the various use cases. Since CohortMethod is probably the most restrictive, we could restrict the population to those needed for the CohortMethod vignette (new users of diclofenac or celecoxib).
-
Filter to a subset of concepts to reduce size even further. This dataset will need a copy of the vocab tables, which are relatively large if unfiltered. If we filter everything to say the 100 most prevalent concepts in each domain in the dataset after step 2, we will have reduced the size of the embedded vocabulary by a huge fraction, and the size of the dataset by a fair chunk.
-
Embed this CDM database (including filtered vocab) in an R package.
Ideally, the end result would just be a few megabytes.
For step 4, one idea would be to use the RSQLite package, and store the dataset inside our dataset package in a SQL-queryable database. In theory, this code could work without installing any other software, and without having access to a database server:
install.packages("OhdsiDataset")
library(OhdsiDataset)
connection <- connect(dbms = "sqllite", server = "OhdsiDataset")
querySql(connection, "SELECT COUNT(*) FROM person;")
#12345
Let me know what you think of this idea. What would be a good starting point (step 1)?
Also, if we agree this is a good idea, I could really use some help (meaning someone willing to write and test steps 1-3 in R, so I can focus on 4).