Group: Synthetic data has helped us to further the mission of OHDSI. Most of our work is related to CMSsynpuf - but are there others? Could we use this thread to create such a list?
I spoke with Jason Walonoski from the Synthea team 2 weeks ago and we’ve agreed to collaborate on ways to integrate Synthea into the OHDSI architecture. I have been able to use Synthea to simulate CSV output and will be looking to create a converter to CDM v5.
Love Synthea (thank you devs!) but we found it pretty challenging to define new populations in our experience from trying to port SyntheticMass to another topic. @Frank Did you create new rules or mostly work with it out of the box?
I was working with it out of the box but Jason did spend some time showing me their visual designer for new population modules. I hope to get more involved with that aspect after we finish the converter.
I would really like to add more realistic synthetic data to Eunomia. Does anyone have synthetic datasets they would be willing to share with community?
It’s not easy to develop analytic code for OMOP CDM without realistic data.
I agree it would be great to have more realistic synthetic data for development purposes.
At the global symposium, @ChaoPang presented his work on training a GPT on real data and using it to generate simulated data that has many of the same attributes of the original data. The technical capability is there, but there are some open questions that remain unresolved around privacy and data ownership. In a way, it is same discussion as the one we have for sharing ACHILLES results: given the complexity of the data being shared, it is hard to oversee all that one could do with the data, so most institutions choose not to share.
100K patients. I would love to collaborate on this effort. Especially utilizing Perseus to do the mapping and ETL.
-S
Regarding GPT creation of a large/plausible population of patients, if anyone has been able to achieve that in bulk (n>10,000), I’d like to learn how they achieved this. Current LLMs are expensive to run at that scale (you need a robust GPU at very least).
Thanks @Sanjay_Udoshi and @schuemie! I would definitely like to collaborate on this and do anything I can to make public semi-realistic synthetic observational data a reality. My use case is to test studies before sending them to data partners for exection on all database platforms. Unfortunately many real studies do not run on synpuf or synthea data because the concept distributions are so different. I’ll check out dephi100k and if you’re ok with it make a pull request to add it to Eunomia (The Hades example data R package).
This is really exciting work by a team at Columbia. The question is how to guarantee that the data this model generates does not leak any real information about a person? Suppose the model generates a fake patient record that exactly matches a real patient record. It is bound to happen for short sequences of events. There is a concept called differential privacy that could be used.
Roughly, an algorithm is differentially private if an observer seeing its output cannot tell whether a particular individual’s information was used in the computation. Differential privacy is often discussed in the context of identifying individuals whose information may be in a database. Although it does not directly refer to identification and reidentification attacks, differentially private algorithms provably resist such attacks. -Differential privacy - Wikipedia
So I think if you can show that the model would output the same sequence regardless of if any single individual’s data was included or excluded from the training process then the algorithm is differentially private.
I have no idea how to actually implement this for LLM training but if it is possible we could have provable privacy guarantees.
For those interested check out the work of Thomas Stromer at the University of California who has been working on the theoretical foundations for provable privacy guarantees. Thomas Strohmer - Publications
I would appreciate it if people pointed out the flaw in my thinking here, but I think privacy is not an issue with this type of approach.
Yes, there is a possibility that, given some information about a patient, the model can exactly generate the remaining data of that patient. However, there is no way of knowing whether the generate data is in fact exactly the same as the input data, or a random sequence that happens to also fit the model.