Where can I find OSIM2 data sets?

francois_meyer · September 21, 2017, 5:50pm

I’ve been searching for good simulated patient-level data sets and it seems that the OSIM2 data sets fit what I’m looking for.

However, OSIM2 is not being supported anymore and the generated data sets are not available via the FTP servers. A previous forum topic raised the FTP issue and the data sets were made available again, however the servers are not up at the moment.

Also, I did download a 1M patient data set when it was up a few months ago, but I am struggling to understand the schema of the data set. Many of the CSV files have no records in them. If the data sets are made available again, is there any way that a supported schema explanation could also be made available?

Christian_Reich · September 21, 2017, 6:44pm

@francois_meyer:

The ftp site is being fixed.

It’s data in CDM version 2.0 or 3.0, and only PERSON, CONDITION, DRUG and OBSERVATION_PERIOD are there. The other domains are not modeled, even though you could easily tweak the code (which is also on that server) to add them in.

What do you need to know?

Christian_Reich · September 21, 2017, 7:57pm

One more thing:

OMOP/OHDSI tools are not “supported” in the sense of a commercial software life cycle anyway. It’s Open Source. The community supports everything, and some things are more actively worked on and others can get stale. If something needs to be fixed or improved you can ask the community if somebody is willing to pick it up. But often the problem just becomes your job. So, in a way you are now “supporting” OSIM2, Francois.

francois_meyer · September 21, 2017, 8:33pm

Thank you Christian. I appreciate the quick replies. I am downloading the data sets at the moment.

I will take a detailed look at the data again. As I understand it the following tables are populated:
persons: contains the simulated patients
condition_era: medical conditions experienced by these patients
drug_era: medical drugs prescribed to these patients
observation_period: observations of patients

However, I am unsure about the meaning of the variables CONDITION_CONCEPT_ID and DRUG_CONCEPT_ID in the condition_era and drug_era tables respectively. I assume there is some mapping between condition and drug IDs and the condition and drug names, but I can’t seem to find it.

Christian_Reich · September 22, 2017, 3:06pm

@francois_meyer:

Condition_concept_id contains the foreign key to the CONCEPT table referring to the Condition the patient has. So, if the patient has a heart attack the condition_concept_id contains 4329847, which stands for the SNOMED concept for “Myocardial infarction”. Likewise, the drug_concept_id has a foreign key to the CONCEPT table as well, indicating the drug.

Why would you expect a link between Conditions and Drugs? Because the Conditions “cause” the prescription of certain drugs? If that’s what you mean: You are right, the data are simulated that way. But it is not stored in the CDM.

Frank · September 22, 2017, 7:19pm

I’m in the process of working on a new data simulator. It is not core to anything OHDSI is doing, more a pet project of mine. Would you mind describing the need you have for a simulated data set? I have a few use cases I’m working to support but hearing more about other potential needs would be appreciated.

mgkahn · September 22, 2017, 10:12pm

Frank – I’ve taken a bit of a sideways approach to getting realistic simulated data into OMOP CDM V5. The Synthea patient data simulator generates patient level data based on disease-specific state transition probability models (https://github.com/synthetichealth/synthea; Walonoski J, Kramer M, Nichols J, Quina A, Moesel C, Hall D, et al. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J Am Med Inform Assoc). The team is constantly adding new disease states, mostly outpatient. They are now adding billing information. The program creates output in various formats – CSV, C-CDA, FHIR STU 3.0.

A CS student is wrapping up a summer project to take the FHIR STU 3.0 output generated by Synthea and create an OMOP CDM V5 Postgres database mapped to the OMOP terminology. Version 1.0 is being wrapped up for posting to the Synthea git site as a separate project (https://github.com/synthetichealth/synthea_omop). We intend to find a place in OHDSI to cross list this tool. I have a “to-do” list of additional work that needs to be done by future students. But we should have an initial version available to the community in the next few weeks when Shahab has his code cleaned up and documented.

I have a very new student just spinning up who will look at substituting the existing Synthea disease-specific transition probabilities with observed probabilities extracted from our health system’s data warehouse. The hope is that this simulated database may look more like our patients than the simulated population based on Massachusetts and national statistics that Synthea is currently using. Don’t know if this idea will pan out.

Patrick_Ryan · September 22, 2017, 11:25pm

Thats great @mgkahn, thanks for sharing this. It sounds like you’ve got a
good path forward to a more complete simulated dataset that should serve
the OHDSI community well. Together with SYN-PUF, it would seem this
dataset should be sufficient to meet the community’s simulated dataset
needs. These two simulated datasets would be more appropriate for adoption
than OSIM1 or OSIM2, which served a good use many years ago for the OMOP
experiments, but at this point are dated, unsupported, and inadequate for
testing CDMv5 or the OHDSI toolkit.

jon_duke · September 25, 2017, 3:10pm

@Frank we’ve got a couple ML students working on synthetic datasets taking various approaches (eg MedGAN). We would be happy to help with your work if you are interested.

In terms of our needs, it would ideally incorporate realistic temporal patterns along with comprehensive disease states and transition models. We have less need for measurements and cost data but perhaps others do.

The holy grail would to be able to generate predictive models from synthetic data that will translate to real data. (Obviously this is tricky to do without risking re-identification. The folks at MIT had some success.)

Do we have enough critical mass here for a time-limited WG?

francois_meyer · September 27, 2017, 12:25pm

Thank you for explaining.

I find that the condition_concept_id and drug_concept_id fields to not map to the SNOMED concept IDs that I have. I obtained my SNOMED data from https://www.nlm.nih.gov/healthit/snomedct/international.html.

Where can I obtain the SNOMED concept ID to description mapping that was used to map the concept IDs in the OSIM2 data?

Christian_Reich · September 27, 2017, 12:57pm

@francois_meyer:

Oh man. Looks like you need some serious training of the basics. Sign up for free to the CDM Tutorial at the Symposium, you’ll learn it all. Or invite us to come to your organization to teach it (costs a little bit of money). Or look at the (incomplete) documentation.

Bottom line: What SNOMED calls conceptid we call concept_code. Our concept_id are generated on top of the identifiers of the original vocabularies have. Why? Because they are not unique across each other. Instead of the SNOMED browser feel free to use ATLAS to browse the data, or download the vocabularies from ATHENA.