Synthetic data with simulated covid outbreak

Michael_Shamberger · March 30, 2020, 6:33am

Here is a synthetic omop csv data set that contains a covid pandemic. I started with synthea covid branch and generated a dataset for MA, USA. This was converted to omop using ETL-Synthea project. The pandemic starts Jan1.2020 and affects all 10K patients on a standard distribution over a period of 3 months.

Let me know if any particular need for synthetic data. I am building azure pipelines to generate it on a schedule.

https://dev.azure.com/shambergerm/Covid19/_git/covid19Storage?path=%2Fomop%2FMassachusetts_covid19_omop_531.zip

You should download vocab separately from athena. Vocab date for this dataset was 29.3.2020 and includes latest covid codes.

synthea covid19 branch and modules:

Here is example visual simulation of similar dataset generated for Finland using cdm 6.0 that has the geo capability since lat/lon on location:

Andy_Kanter · March 30, 2020, 1:56pm

Does this include more specific COVID-19 diagnoses and tests coded with appropriate SNOMED, ICD-10 and LOINC codes? What about SARS-CoV-2 positive patients with other manifestations (pneumonia, ARDS, etc.) that require multiple ICD-10 and/or SNOMED codes per diagnosis? Thanks!

Michael_Shamberger · March 30, 2020, 3:37pm

It contains specific COVID-19 diagnosis and tests coded with SNOMED and LOINC. Synthea does not use ICD10 as it requires payment.

Survivor, non survivor lab values based on Figure 2 from https://doi.org/10.1016/S0140-6736(20)30566-3
“Clinical course and risk factors for mortality of adult inpatients with COVID-19 in Wuhan, China: a retrospective cohort study”

Here are some examples:
SNOMED:
49727002, Cough (finding)
386661006, Fever (finding)
267036007, Dyspnea (finding)
233604007,Pneumonia (disorder)
840544004,Suspected COVID-19
840539006,COVID-19

LOINC
89577-1
89579-7

Synthea has these modules for covid19 simulation:

Risk determination

github.com

synthetichealth/synthea/blob/covid19/src/main/resources/modules/covid19/determine_risk.json

{
  "name": "determine_risk",
  "states": {
    "Initial": {
      "type": "Initial",
      "direct_transition": "Determine Risk",
      "remarks": [
        "Assess patient comorbidities and set attribute (covid19_risk):",
        "  - (high) if patient has comorbidity impacting risks",
        "  - (low) if patient has no comorbidity impacting risks"
      ]
    },
    "Terminal": {
      "type": "Terminal"
    },
    "Determine Risk": {
      "type": "Simple",
      "conditional_transition": [
        {
          "transition": "High Risk of Severe Disease",

This file has been truncated. show original

Infection sequence

github.com

synthetichealth/synthea/blob/covid19/src/main/resources/modules/covid19/infection.json

{
  "name": "infection",
  "states": {
    "Initial": {
      "type": "Initial",
      "direct_transition": "Determine Risk"
    },
    "Terminal": {
      "type": "Terminal"
    },
    "Encounter for Test": {
      "type": "Encounter",
      "encounter_class": "ambulatory",
      "reason": "",
      "codes": [
        {
          "system": "SNOMED-CT",
          "code": 185345009,
          "display": "Encounter for symptom (procedure)"
        }

This file has been truncated. show original

Non survivor lab values

github.com

synthetichealth/synthea/blob/covid19/src/main/resources/modules/covid19/nonsurvivor_lab_values.json

{
  "name": "nonsurvivor_lab_values",
  "remarks": [
    "Based on Figure 2 from https://doi.org/10.1016/S0140-6736(20)30566-3"
  ],
  "states": {
    "Initial": {
      "type": "Initial",
      "direct_transition": "Day"
    },
    "Terminal": {
      "type": "Terminal"
    },
    "Day": {
      "type": "Simple",
      "conditional_transition": [
        {
          "transition": "DDimer_4",
          "condition": {
            "condition_type": "Attribute",

This file has been truncated. show original

Survivor lab values

github.com

synthetichealth/synthea/blob/covid19/src/main/resources/modules/covid19/survivor_lab_values.json

{
  "name": "survivor_lab_values",
  "remarks": [
    "Based on Figure 2 from https://doi.org/10.1016/S0140-6736(20)30566-3"
  ],
  "states": {
    "Initial": {
      "type": "Initial",
      "direct_transition": "Day"
    },
    "Terminal": {
      "type": "Terminal"
    },
    "Day": {
      "type": "Simple",
      "conditional_transition": [
        {
          "transition": "DDimer_4",
          "condition": {
            "condition_type": "Attribute",

This file has been truncated. show original

tom.white.md · November 19, 2020, 10:06pm

Michael, are you still creating COVID synthetic data? We’re standing up our OHDSI infrastructure on Azure, and want to start testing the environment, and sizing our needs, based upon a population size of about 2 million patients.

Michael_Shamberger · December 3, 2020, 4:28pm

Missed this one.

I can tell process for making the data:

Generate data with synthea (at the time there was a covid branch and they have made a lot of improvements since then. Not sure if they merged that one)
Convert data with ETL-Synthea.

I didn’t get ETL-Synthea to scale over around 50k patients. The scripts would fail in postgres even with my laptop having 32GB of ram and ssd drive. They try to process the data all at once and not in batches.

I was working on a translator from synthea to omop based on python pandas to skip the postgres database and be able to process in batches. science-automation/ETL-Synthea-Python: ETL from Synthea to OMOP format using python pandas (github.com)

It was meant for just this purpose of generating massive amounts of data.

It is like 98% done but I still needed to do some testing to match the output from ETL-Synthea.

Ben_Simon · November 21, 2022, 6:13pm

Hi! We’re looking for some sample OMOP data for our own work. Do you still have access to the 10k or 100k patient sample datasets mentioned here/in the repo? Couldn’t find it when I looked.