Synthetic data with simulated covid outbreak

Here is a synthetic omop csv data set that contains a covid pandemic. I started with synthea covid branch and generated a dataset for MA, USA. This was converted to omop using ETL-Synthea project. The pandemic starts Jan1.2020 and affects all 10K patients on a standard distribution over a period of 3 months.

Let me know if any particular need for synthetic data. I am building azure pipelines to generate it on a schedule.


You should download vocab separately from athena. Vocab date for this dataset was 29.3.2020 and includes latest covid codes.

synthea covid19 branch and modules:

Here is example visual simulation of similar dataset generated for Finland using cdm 6.0 that has the geo capability since lat/lon on location:

Does this include more specific COVID-19 diagnoses and tests coded with appropriate SNOMED, ICD-10 and LOINC codes? What about SARS-CoV-2 positive patients with other manifestations (pneumonia, ARDS, etc.) that require multiple ICD-10 and/or SNOMED codes per diagnosis? Thanks!

It contains specific COVID-19 diagnosis and tests coded with SNOMED and LOINC. Synthea does not use ICD10 as it requires payment.

Survivor, non survivor lab values based on Figure 2 from https://doi.org/10.1016/S0140-6736(20)30566-3
“Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study”

Here are some examples:
49727002, Cough (finding)
386661006, Fever (finding)
267036007, Dyspnea (finding)
233604007,Pneumonia (disorder)
840544004,Suspected COVID-19


Synthea has these modules for covid19 simulation:

Risk determination

Infection sequence

Non survivor lab values

Survivor lab values

Michael, are you still creating COVID synthetic data? We’re standing up our OHDSI infrastructure on Azure, and want to start testing the environment, and sizing our needs, based upon a population size of about 2 million patients.