OHDSI Home | Forums | Wiki | Github

Has anyone use OSIM data sets?

@Christian_Reich:

Hello again! Sorry for the late response, I had to talk with my professor to decide on some details of the project.

What we are trying to do is to use the OSIM2 data for online learning using several multi-armed bandit algorithms. This unfortunately requires case-by-case data, where each data sample has a context (age and gender), condition, drug and an outcome. For OSIM2 data, a row of the desired dataset would look like this:

PERSON_ID, CONDITION_CONCEPT_ID, DRUG_CONCEPT_ID, OUTCOME

The drug in DRUG_CONCEPT_ID is the drug that was given for the condition in CONDITION_CONCEPT_ID of the person identified by PERSON_ID, and OUTCOME is whether a side effect or health benefit (or possibly nothing) occured as a result.

I am well aware that OSIM2 might not have been designed with such a use case in mind. It seems it would be quite difficult to generate new data in this manner, especially considering how the risk signals are normally infused after the data is generated. Nonetheless, I found the OSIM2 transition probability matrices from someone else in my research group and I am currently trying to generate new data in the way my project requires. If you have any recommendations about how I could progress or if you know of a more suitable dataset I could use, I would be happy to listen!

By the way, let me know if you would like to upload the transition probability matrices to the FTP server for possible future use, I can send them if it would be helpful for anyone.

@dogatekin:

Ok, makes sense. Well, your algorithm has to realize that there is no such a thing as a linear condition-drug-outcome chain of events. There are probabilities between them. In real life, we don’t know what they are. In a simulated life we do, but we don’t know how realistic they are. Think about it: The simplistic logic is that a drug gets prescribed to treat a certain condition. In reality this is often the case (high blood pressure causes the prescription of an anti-hypertensive drug), but in other cases it is not (beta blockers also help you calm down, which might be an intended effect, or not). But we rarely capture any of this. The doctor just prescribes “what’s good for you” and doesn’t bother explaining too much. That’s for the first half of the triple. The second half, the relationship between treatment and outcome, is even murkier. We have these self-healing bodies, and it is very hard to say for sure what it was that made the patient get better. In some cases, we know for sure it wasn’t the drug (like antibiotics prescribed for common colds, which is a viral disease in most cases). As a result, we rarely try to resolve this on an individual patient basis, but look for causal effects in the entire population of patients that share some features we are studying. It’s called population-based estimation, and there is a whole part of OHDSI building methods and tools to help with that.

So far for the theory. In your particular case, you can use OSIM2 to play with it. Two things:

  1. There is no difference between Condition and Outcome. Outcome just means you get more or fewer Conditions (the rate changes), or a higher or lower degree of a Condition (the quantity of something changes), or faster or slower time to onset or end of a Condition. So, your triple is Condition-Drug-Condition.
  2. OSIM2 let’s you model both halves. But you need the transition matrix in order to create realistic first halves. The transition matrix of the data on the FTP site is gone and cannot be reproduced (it’s based on a database of a few years ago, nobody has that anymore).

So, you need to recreate the whole thing. For that, you need access to data. You either have data, or you need to get them from organizations that sell them, like QuintilesIMS (where I work), Truven, Optum or so. These places usually want money for that, but can be convinced to give you access to the data for free in exchange to some useful artifact, insight or paper. One artifact could be that you will provide the simulated data back to them, especially if they contained more than just Condition and Drug.

Does that help?

I urge @dogatekin to look for my earlier posting about the Synthea patient simulator, which encompasses the probabilistic, state-transition features that @Christian_Reich discusses above. We are just a few weeks away from posting a Synthea to OMOP CDM V5.1 tool to the community (the problem with using students to do this work is that things like class and PhD defenses get in the way…) To look at Synthea, go to https://github.com/synthetichealth/synthea

@Christian_Reich:

Thanks very much for both the theoretical and practical explanations, you have been incredibly helpful!

One final question: the outcomes are infused through signals which have “Relative Risk” values. After counting the drug-outcome pairs in the risk-free data, these risk values are used to determine how many conditions to add into the database for certain patients that took risky drugs.

Do you know if there is any logical way of converting these Relative Risk values into probabilities? It would be very useful if I was able to say something along the lines of “this drug will cause this new condition to appear with probability 0.05” but I am not sure if that is possible with the OSIM2 data.

@mgkahn:

Thank you for the kind suggestion. My professor wants me to obtain some results using OSIM2 before trying other datasets or simulators but if I have time afterwards, I will make sure to check out Synthea. Good luck with your Synthea to OMOP CDM tool!

Well, baseline probability * relative risk = resulting overall probability. So, if your baseline probability without taking the drug is according the transition matrix, say, 0.01, and the relative risk is 5, then the probability of your drug takers getting the outcome is 0.05. Not a lot of complexity, here.

Indeed there would not have been a lot of complexity here, if only there was such a “baseline probability of outcomes without drugs” transition matrix. In the OSIM2 package, the baseline probabilities (or current prevalence as it is called in the package) are calculated on a case-by-case basis by counting the patients with each outcome after the conditions and outcomes are generated separately. Hence it changes at each run.

What I was wondering was if there was a table that had the risk probabilities of these drugs which was perhaps used to calculate these relative risk values.

It might be a little late for this, but as part of my research I’m trying to update the OSIM code to generate synthetic data (including the analysis phase that generates the transition probability tables) based off OMOP CDM V5. While I’m still a couple of weeks away from a working prototype, if you still require transitional matrixes for your research I’d be happy to send them across once I have them.

@kausarm:

That’s great! Are you going to post it back to the community?

I see your problem. Well, there is the transition matrix probability between all the conditions and all the drugs. You might be able to calculate an aggregate probability, taking the probability of having the condition into account. Sounds like you have a good chunky project, there. Kausar might be able help you by giving you code you can run and build small test databases to play with…

Yes, the current plan is to post the code back to the community once I’m done. The only potential difference from the OSIM 2 simulator would be that the code repository would be in Postgres and not SQL Server.

@kausarm: Perfect. Postgres is totally fine. OHDSI is not married to one SQL flavor.

@Christian_Reich Thats great! Also, do you know where I can get documentation for OMOP CDM V2? It would really help move the conversion process along. I was only able to find documentation for V4 and V5 on the official website so far.

Thanks!

Hi Christian, I am jumping on this thread because I have a related question, something that despite all my research on the OSIM2 project history could not grasp: The content of the OSIM2 folder on the ohdsi ftp site contains the raw (original) MSLR data together with the OSIM2 derived data or only OSIM2 derievd data? Is this assumption correct that:

OSIM2_1M_MSLR_SNOMED_0_CSV.tar.gz 466 MB //-------this is the original MSLR data, as received from Truven (actually it has around 700K persons not 1 M)

OSIM2_50M_MDCD_SNOMED_0.tar.gz 18.3 GB 6/28/16, //--------OSIM 2 derived set with 50 million synthetic records
osim2_50M_mdcd_snomed_1.tar.gz 8.9 GB //------similar to above but only the drug_era file

A second question would be: Is this data linkable through person_id to the ohdsi_in_a_box dataset (the data inside the postgres database on the ohdsi vbox machine) or is totally differnt?

Thanks a lot

Radu Bengulescu MD
Healthcare Informatics PhD

t