Synthetic dataset

Recently, I have read several papers regarding generating synthetic data in health care and found out that one of the most used methods is the Bayesian Network. In order to generate a dataset on Type2 Diabetes patients that would strength fully represent the real population, do you have any other suggestions except the BN technique?

We developed a Bayesian Network version and it can’t represent everything about a population; instead, it can be generate a specific subset of interesting data and maintain some relationships. So the purely synthetic versions of data are limited. Another mechanism people are trying is synthetic derivatives from real data - using tools like MDClone or the specifications from Acorn.ai … still has limitations, of course.

Hi @shohreh ,

have you looked into this

There are even scripts to transform it to the CDM

To my best of understanding, Synthea is useful for generating synthetic data based on their own knowledge. What I’m looking for is to generate synthetic data based on my own data. How can I do that?

Hey everyone,
I am not sure if anyone is still interested in this topic. I wanted to share a tutorial on synthetic data generation for OMOP using our recently released foundation model architecture, CEHR-XGPT: A Scalable Multi-Task Foundation Model for Structured Electronic Health Records (EHRs).

You can try it out through a Google Colab notebook in the repo cehrgpt_tutorials. If you have an OMOP instance, you’ll be able to easily generate synthetic data yourself. I hope this tutorial is useful to the OHDSI community, especially for those interested in exploring synthetic data as a resource for research and method development.

Preprint is available here: [2509.03643] CEHR-XGPT: A Scalable Multi-Task Foundation Model for Electronic Health Records
Source code is available here: GitHub - knatarajan-lab/cehrgpt: CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines