How to select one event from multiple events of a subject for analysis?

Akshay · February 5, 2020, 7:28am

Hello Everyone,

I have a dataset where I am working on a binary classification. I have two classes of subjects. One is Outpatients and Other is Inpatients. (66:33 is the class proportion)

My objective is to identify the risk factors that influence hospital admission. (simple study to begin with)

But the problem here is, I have my dataset like as below

Let’s say we have a subject called “John”. He has visited hospital 19 times based on my data duration from Jan 2001- Dec 2005. All of his 19 visits are outpatients.
Let’s say we have another subject called “Jack”. He has visited the hospital 34 times based on data duration from Jan 2001-Dec 2005. Out of 34 visits, he has been admitted as inpatient 18 times and rest 16 are outpatient visits.

So now my question is

Usually for analysis, we only see one record per subject/individual. Right? But now on what basis should I pick that one record?

Meaning, for John out of his 19 visits, which one should I pick? Only way is to summarize all his multiple records into one record?

Similarly for Jack, out of his 18 inpatient visits, which one should I pick? Only way is to summarize all his multiple records into one record?

We try to choose only one out of 18 from Jack because we don’t need his outpatient info as we already have a separate group of outpatients and he is considered for Inpatient class.

Once I have one record per person, I am planning to use Binary classification algorithm.

If you think there is no reason to pick only one record, will I be able to do analysis with multiple records per person? Can you direct me to the resource where I can learn about this theory/technique? What is that approach called?

Is there any other way to do this? Has anyone done similar study?

Hope my question is clear and kindly request you to help me understand on how to do this with simple explanation please?

Chris_Knoll · February 6, 2020, 4:55am

I’m not sure about the ‘binary classification’ part, but it sounds to be like you are defining a prediction problem.

You raise the question about: among all visits a person has, which one should you use?

When faced with a prediction problem, I’ve been instructed that you need to define a point in time where all people in the population are in a consistent decision-making point about the prediction you desire to make. ie: the person has decided to undergo a surgery, so you want to predict the outcome of deep-vein thrombosis for those people who have underwent the surgery. So, when the patient undergoes the surgery, you can leverage a predictive model about risk of the outcome.

In this case, where you want to index on some kind of visit, why not just simply create a cohort of ‘earliest visit that has prior continuous observation of 1 year’. So, in this context, the idea is that you are given a person who has been under observation for a year, and you want to predict a hospital (or inpatient) visit within 365 days.

In Atlas, you’d create a prediction study. Your target cohort would be:
Earliest visit after 1 year of continuous observation

Your outcome you are attempting to find are:
All inpatient visits after 1 year of continuous observation

Running the PLP package out of Atlas, it will create a predictive model based on the ‘features’ of the population you select (such as drugs, procedures, conditions, etc).

I’m not an expert, but there are a few tutorial videos on this subject that you can find on the main OHDSI website. But, I think I gave you the basics of it that might get you started.