Has anyone use OSIM data sets?

francois_meyer · September 20, 2017, 1:47pm

@Christian_Reich Hi. It seems that the FTP servers are down at the moment. Will they be back up at any time? Alternatively, is there anywhere else to find these simulated data sets?

Christian_Reich · September 21, 2017, 6:45pm

@francois_meyer:

Working on it.

dogatekin · October 8, 2017, 2:45pm

@Christian_Reich:

Hello Christian,

I downloaded the OSIM 2 data from the FTP server you kindly provided. My aim is to use the dataset in a multi-armed bandit setting, for which I could really use the true relationships between the conditions and drugs. As far as I can tell, these relationships were provided in the OMOP Cup to the competitors along with the data.

Is there any way for me to access this information? It would be great if I could somehow download the same package you provided for the competitors in the OMOP Cup. Thanks in advance!

Christian_Reich · October 8, 2017, 3:06pm

@dogatekin:

Yeah. It exists. If I only knew where. There is a probability matrix for each condition and each drug, based on the true relationships found in the data. I can ask around, Which OSIM file are you using?

But if you really want to go that deep into this it may be better if you build one from scratch, the code is available. As explained in this Forum debate, nobody is really working on OSIM2 anymore, but there are other initiatives.

dogatekin · October 8, 2017, 5:54pm

Thanks for your very quick response.

Right now, I am playing around with OSIM2_1M_MSLR_SNOMED_0, but I also downloaded the 50M files with and without the risk signals infused. I did find a CSV file called “signal_ref” that contains the relative risks connecting drugs and outcomes, is that the file you were talking about by any chance? Or is there a complete probability matrix elsewhere?

I was hoping I would not have to build one from scratch, but I will of course look into it if I have to. If I do end up building one, would I need a complete observational database to base the simulated persons on?

Christian_Reich · October 9, 2017, 5:58pm

@dogatekin:

Then the files should be OSIM2_1M_MSLR_SNOMED_0 and OSIM2_1M_MSLR_SNOMED_1.

The file signal_ref contains the the signals infused after simulation. Those signals are drugs to conditions (side effects), not the other way around.

The probabilities between drugs and drugs and drugs and conditions are in the Transition Matrix. The file name should be “OSIM2 MSLR SNOMED Transition Matrix”, where “MSLR” is the name of the database.

dogatekin · October 9, 2017, 10:20pm

@Christian_Reich:

I actually only have access to the one without the risks infused (0), since that is the only one on the ftp server: ftp://ftp.ohdsi.org/osim2/ Are there more data I can download somewhere else?

That is actually what I am looking for, sorry if I confused you with my previous message It was my understanding that the true relationships provided in the OMOP Cup were about the relationships from drugs to conditions.

Which brings us to my final question, where are these transition matrices? I finally decided to build a dataset from scratch using the OSIM 2 code, but I either need a real observational database or (more conveniently) the transition matrices. I unfortunately cannot find any in the ftp server, am I missing something?

Sorry to keep nagging you about an “unsupported” data simulation tool, but I could really use a little help!

Christian_Reich · October 10, 2017, 12:05am

@dogatekin:

Yeah, you are a little out of luck. Maybe @richm or @mkhayter still have old versions. The data files used to be stored on AWS and went where all living things eventually go - when that server was switched off due to lack of funding. What’s on the FTP site are a few remnants, and the code to make yourself a new one. If you want to. The question is, why would you want that.

The way it worked was this: The simulator would simulate conditions, as a function of existing conditions (or lack thereof). And it would simulate drugs based on conditions. That is a world, where no drug causes any conditions, which means, they have no side effect. These are the _0 files. The _1 files resulted from adding those side effects based on a separate risk factor provided. The OMOP cup would then have various competing methods try to figure out what that risk was. Nice idea, except it forgets that there are all these relationships between everything called confounding. So the best method in the cup wasn’t necessarily the best method in finding the risks in the real world.

Makes sense? What is your use case? What are you trying to do?

If you really want to recreate OSIM2, you will have somebody run the script building the transition matrix. So, you have to go to a data owner or subscriber. Happy to help you out if you can tell me more what it is you want to do.

dogatekin · October 22, 2017, 5:25pm

@Christian_Reich:

Hello again! Sorry for the late response, I had to talk with my professor to decide on some details of the project.

What we are trying to do is to use the OSIM2 data for online learning using several multi-armed bandit algorithms. This unfortunately requires case-by-case data, where each data sample has a context (age and gender), condition, drug and an outcome. For OSIM2 data, a row of the desired dataset would look like this:

PERSON_ID, CONDITION_CONCEPT_ID, DRUG_CONCEPT_ID, OUTCOME

The drug in DRUG_CONCEPT_ID is the drug that was given for the condition in CONDITION_CONCEPT_ID of the person identified by PERSON_ID, and OUTCOME is whether a side effect or health benefit (or possibly nothing) occured as a result.

I am well aware that OSIM2 might not have been designed with such a use case in mind. It seems it would be quite difficult to generate new data in this manner, especially considering how the risk signals are normally infused after the data is generated. Nonetheless, I found the OSIM2 transition probability matrices from someone else in my research group and I am currently trying to generate new data in the way my project requires. If you have any recommendations about how I could progress or if you know of a more suitable dataset I could use, I would be happy to listen!

By the way, let me know if you would like to upload the transition probability matrices to the FTP server for possible future use, I can send them if it would be helpful for anyone.

Christian_Reich · October 22, 2017, 7:53pm

@dogatekin:

Ok, makes sense. Well, your algorithm has to realize that there is no such a thing as a linear condition-drug-outcome chain of events. There are probabilities between them. In real life, we don’t know what they are. In a simulated life we do, but we don’t know how realistic they are. Think about it: The simplistic logic is that a drug gets prescribed to treat a certain condition. In reality this is often the case (high blood pressure causes the prescription of an anti-hypertensive drug), but in other cases it is not (beta blockers also help you calm down, which might be an intended effect, or not). But we rarely capture any of this. The doctor just prescribes “what’s good for you” and doesn’t bother explaining too much. That’s for the first half of the triple. The second half, the relationship between treatment and outcome, is even murkier. We have these self-healing bodies, and it is very hard to say for sure what it was that made the patient get better. In some cases, we know for sure it wasn’t the drug (like antibiotics prescribed for common colds, which is a viral disease in most cases). As a result, we rarely try to resolve this on an individual patient basis, but look for causal effects in the entire population of patients that share some features we are studying. It’s called population-based estimation, and there is a whole part of OHDSI building methods and tools to help with that.

So far for the theory. In your particular case, you can use OSIM2 to play with it. Two things:

There is no difference between Condition and Outcome. Outcome just means you get more or fewer Conditions (the rate changes), or a higher or lower degree of a Condition (the quantity of something changes), or faster or slower time to onset or end of a Condition. So, your triple is Condition-Drug-Condition.
OSIM2 let’s you model both halves. But you need the transition matrix in order to create realistic first halves. The transition matrix of the data on the FTP site is gone and cannot be reproduced (it’s based on a database of a few years ago, nobody has that anymore).

So, you need to recreate the whole thing. For that, you need access to data. You either have data, or you need to get them from organizations that sell them, like QuintilesIMS (where I work), Truven, Optum or so. These places usually want money for that, but can be convinced to give you access to the data for free in exchange to some useful artifact, insight or paper. One artifact could be that you will provide the simulated data back to them, especially if they contained more than just Condition and Drug.

Does that help?

mgkahn · October 22, 2017, 8:23pm

I urge @dogatekin to look for my earlier posting about the Synthea patient simulator, which encompasses the probabilistic, state-transition features that @Christian_Reich discusses above. We are just a few weeks away from posting a Synthea to OMOP CDM V5.1 tool to the community (the problem with using students to do this work is that things like class and PhD defenses get in the way…) To look at Synthea, go to https://github.com/synthetichealth/synthea

dogatekin · November 19, 2017, 12:02pm

@Christian_Reich:

Thanks very much for both the theoretical and practical explanations, you have been incredibly helpful!

One final question: the outcomes are infused through signals which have “Relative Risk” values. After counting the drug-outcome pairs in the risk-free data, these risk values are used to determine how many conditions to add into the database for certain patients that took risky drugs.

Do you know if there is any logical way of converting these Relative Risk values into probabilities? It would be very useful if I was able to say something along the lines of “this drug will cause this new condition to appear with probability 0.05” but I am not sure if that is possible with the OSIM2 data.

@mgkahn:

Thank you for the kind suggestion. My professor wants me to obtain some results using OSIM2 before trying other datasets or simulators but if I have time afterwards, I will make sure to check out Synthea. Good luck with your Synthea to OMOP CDM tool!

Christian_Reich · November 19, 2017, 1:38pm

Well, baseline probability * relative risk = resulting overall probability. So, if your baseline probability without taking the drug is according the transition matrix, say, 0.01, and the relative risk is 5, then the probability of your drug takers getting the outcome is 0.05. Not a lot of complexity, here.

dogatekin · November 19, 2017, 3:46pm

Indeed there would not have been a lot of complexity here, if only there was such a “baseline probability of outcomes without drugs” transition matrix. In the OSIM2 package, the baseline probabilities (or current prevalence as it is called in the package) are calculated on a case-by-case basis by counting the patients with each outcome after the conditions and outcomes are generated separately. Hence it changes at each run.

What I was wondering was if there was a table that had the risk probabilities of these drugs which was perhaps used to calculate these relative risk values.

kausarm · November 19, 2017, 9:21pm

It might be a little late for this, but as part of my research I’m trying to update the OSIM code to generate synthetic data (including the analysis phase that generates the transition probability tables) based off OMOP CDM V5. While I’m still a couple of weeks away from a working prototype, if you still require transitional matrixes for your research I’d be happy to send them across once I have them.

Christian_Reich · November 20, 2017, 11:39am

@kausarm:

That’s great! Are you going to post it back to the community?

I see your problem. Well, there is the transition matrix probability between all the conditions and all the drugs. You might be able to calculate an aggregate probability, taking the probability of having the condition into account. Sounds like you have a good chunky project, there. Kausar might be able help you by giving you code you can run and build small test databases to play with…

kausarm · November 20, 2017, 5:47pm

Yes, the current plan is to post the code back to the community once I’m done. The only potential difference from the OSIM 2 simulator would be that the code repository would be in Postgres and not SQL Server.

Christian_Reich · November 20, 2017, 7:27pm

@kausarm: Perfect. Postgres is totally fine. OHDSI is not married to one SQL flavor.

kausarm · November 20, 2017, 9:27pm

@Christian_Reich Thats great! Also, do you know where I can get documentation for OMOP CDM V2? It would really help move the conversion process along. I was only able to find documentation for V4 and V5 on the official website so far.

Thanks!

Radu · February 7, 2018, 10:16am

Hi Christian, I am jumping on this thread because I have a related question, something that despite all my research on the OSIM2 project history could not grasp: The content of the OSIM2 folder on the ohdsi ftp site contains the raw (original) MSLR data together with the OSIM2 derived data or only OSIM2 derievd data? Is this assumption correct that:

OSIM2_1M_MSLR_SNOMED_0_CSV.tar.gz 466 MB //-------this is the original MSLR data, as received from Truven (actually it has around 700K persons not 1 M)

OSIM2_50M_MDCD_SNOMED_0.tar.gz 18.3 GB 6/28/16, //--------OSIM 2 derived set with 50 million synthetic records
osim2_50M_mdcd_snomed_1.tar.gz 8.9 GB //------similar to above but only the drug_era file

A second question would be: Is this data linkable through person_id to the ohdsi_in_a_box dataset (the data inside the postgres database on the ohdsi vbox machine) or is totally differnt?

Thanks a lot

Radu Bengulescu MD
Healthcare Informatics PhD