OHDSI Home | Forums | Wiki | Github

How to handle multiple diagnosis/drug_era/etc for one patient id in patient level prediction

Hi. I’m trying to predict death (patient level prediction) in python with CDM data.
The problem is that if I use python to predict death, then I need to build a dataframe of patient data somehow.
However, as you know, for a given attrubute(column), say, diagnosis, and for a one patient id, there are possibly multiple diagnosis with different date.
Since I need to predict the death of a person, I need to have a one row for each person but how can I deal with multiple diagnosis?
There is twoway that I can think of: The first is that make a multiple attribute, say diagnosis 1, diagnosis 2, and so on and put as many diagnosis in different columns. But, if I do that, then there would be many missing data, because some patient have many diagnosis but others don’t.
The second is that use value of Diagnosis for variable; for example make a column of lung cancer and if a person is diagnosed with lung cancer then let the value be 1 and 0 otherwise. This would make a sparse matrix which make poor prediction rate.
So, how can I deal with such problem?
Thank you ahead for your help.


That’s the nature of the health data. Most people have no diagnosis, some one, a few a few and even fewer people many. How many illnesses can you have?10? 100? It’s called a sparse matrix. Most cells are 0 (not missing).

You may want to save yourself the effort. There is a Patient Level Prediction package taking care of all those issues, and many others. It’s not in Python though.

Good luck.


Thanks for clarifying the issue. I may use R if I can, with the help of atlas defining cohort, bur for now I’m stick with the python until I figure out how to define cohort with the person Id only. If you have a time, feel free to see my other questions Defining cohort with person id - #2 by Chris_Knoll that I’m struggling. I’m not using atlas cohort definition, but rather building a cohort out of sql import-export tab. This is a bit interesting question I think and helpful for other people if they understand my problem, since people can define a cohort by themselves and upload it on the atlas using import-export tab. Thank you Christian.

Hm. Not sure. @Chris_Knoll said in that other Forum post how to do it: Write the COHORT table by hand (SQL, or from Python). Pick a cohort_id that isn’t already used, put the person_id into the subject_id field and write out cohort_start_date and cohort_end_date as you wish. Then you are ready to go.

I am not sure how Atlas will recognize a cohort that it didn’t write itself. @Chris_Knoll might know.

It won’t. But you can get to most of the Atlas functions via R using Hades packages. I think I understand that the main idea is to be able to define a cohort outside of atlas, but then use it inside atlas for atlas functions (the only two would be characterization and incidence rate) but sourcing external cohorts isn’t a supported operation. But there should be many examples in the community on how to characterize a cohort or calculate an incidence rate using HADES.

Oh! I think I had a brainwave here:

Can you assign the set of people of interest some standard concept (or it could be a source concept) that would only be assigned to those people in the population? Say, put it in Observation table…and assign that Observation record the specific date you want to have those people to have. Then, you can use Cohort definitions to find people by observation and use the observation date as the cohort start and end.

Brilliant! You. are. awesome. Thank you for great idea.
Though, I hope the atlas get updated to manage these type of concern at some point.
Thanks Chris.