OHDSI Home | Forums | Wiki | Github

Synthetic and de-identified data

Group: Synthetic data has helped us to further the mission of OHDSI. Most of our work is related to CMSsynpuf - but are there others? Could we use this thread to create such a list?

Fake or synthetic data

  1. CMSSynpuf
  2. Data Entrepreneur Clinical Observables Yardstick
  3. PseudoVet
  4. Original OMOP OSIM 1 and OSIM 2
  5. Harvest from CHOP
  6. OpenMRS has limited data
  7. iPDG

There are de-identified but individual level data such as:

  1. MIMIC

Anything else


There is also a paper about producing synthetic patients and synthetic EHR based on HL7-FHIR

1 Like

@SCYou We just starting using SyntheticMass as well! Great addition to this list.

1 Like

@krfeeney Cool! It would be great topic for OHDSI infrastructure!

@kausarm will be presenting on our release of a CDM v5 compatible version of OSIM. Join the community call April 17th for details.

@jon_duke @kausarm Great! Thank you!!

I spoke with Jason Walonoski from the Synthea team 2 weeks ago and we’ve agreed to collaborate on ways to integrate Synthea into the OHDSI architecture. I have been able to use Synthea to simulate CSV output and will be looking to create a converter to CDM v5.

1 Like

Love Synthea (thank you devs!) but we found it pretty challenging to define new populations in our experience from trying to port SyntheticMass to another topic. @Frank Did you create new rules or mostly work with it out of the box?

I was working with it out of the box but Jason did spend some time showing me their visual designer for new population modules. I hope to get more involved with that aspect after we finish the converter.

I would really like to add more realistic synthetic data to Eunomia. Does anyone have synthetic datasets they would be willing to share with community?

It’s not easy to develop analytic code for OMOP CDM without realistic data.

I agree it would be great to have more realistic synthetic data for development purposes.

At the global symposium, @ChaoPang presented his work on training a GPT on real data and using it to generate simulated data that has many of the same attributes of the original data. The technical capability is there, but there are some open questions that remain unresolved around privacy and data ownership. In a way, it is same discussion as the one we have for sharing ACHILLES results: given the complexity of the data being shared, it is hard to oversee all that one could do with the data, so most institutions choose not to share.

1 Like

I have a large synthetic dataset here:

100K patients. I would love to collaborate on this effort. Especially utilizing Perseus to do the mapping and ETL.


Regarding GPT creation of a large/plausible population of patients, if anyone has been able to achieve that in bulk (n>10,000), I’d like to learn how they achieved this. Current LLMs are expensive to run at that scale (you need a robust GPU at very least).

1 Like

Is what he presented part of Project Apollo?

Here is the link: https://www.ohdsi.org/2023showcase-503/ (I’m not involved in the project)

I believe a GPT2-like architecture was used (much smaller than the LLMs of today), so inference should be “cheaper”.

1 Like

Thanks @Sanjay_Udoshi and @schuemie! I would definitely like to collaborate on this and do anything I can to make public semi-realistic synthetic observational data a reality. My use case is to test studies before sending them to data partners for exection on all database platforms. Unfortunately many real studies do not run on synpuf or synthea data because the concept distributions are so different. I’ll check out dephi100k and if you’re ok with it make a pull request to add it to Eunomia (The Hades example data R package).

1 Like

This is really exciting work by a team at Columbia. The question is how to guarantee that the data this model generates does not leak any real information about a person? Suppose the model generates a fake patient record that exactly matches a real patient record. It is bound to happen for short sequences of events. There is a concept called differential privacy that could be used.

Roughly, an algorithm is differentially private if an observer seeing its output cannot tell whether a particular individual’s information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms provably resist such attacks. -Differential privacy - Wikipedia

So I think if you can show that the model would output the same sequence regardless of if any single individual’s data was included or excluded from the training process then the algorithm is differentially private.

I have no idea how to actually implement this for LLM training but if it is possible we could have provable privacy guarantees.

For those interested check out the work of Thomas Stromer at the University of California who has been working on the theoretical foundations for provable privacy guarantees. Thomas Strohmer - Publications

1 Like

I would appreciate it if people pointed out the flaw in my thinking here, but I think privacy is not an issue with this type of approach.

Yes, there is a possibility that, given some information about a patient, the model can exactly generate the remaining data of that patient. However, there is no way of knowing whether the generate data is in fact exactly the same as the input data, or a random sequence that happens to also fit the model.

1 Like

(Similarly to how a random number generator is not a security risk, even though it could generate my social security number)

Agreed. :slight_smile:

Can we setup a brief call to discuss this (zoom/teams)? You can pm me for my email etc.

1 Like