I have a dataset where I am working on a binary classification. I have two classes of subjects. One is Outpatients and Other is Inpatients. (66:33 is the class proportion)
My objective is to identify the risk factors that influence hospital admission. (simple study to begin with)
But the problem here is, I have my dataset like as below
Let’s say we have a subject called “John”. He has visited hospital 19 times based on my data duration from Jan 2001- Dec 2005. All of his 19 visits are outpatients.
Let’s say we have another subject called “Jack”. He has visited the hospital 34 times based on data duration from Jan 2001-Dec 2005. Out of 34 visits, he has been admitted as inpatient 18 times and rest 16 are outpatient visits.
So now my question is
- Usually for analysis, we only see one record per subject/individual. Right? But now on what basis should I pick that one record?
Meaning, for John out of his 19 visits, which one should I pick? Only way is to summarize all his multiple records into one record?
Similarly for Jack, out of his 18 inpatient visits, which one should I pick? Only way is to summarize all his multiple records into one record?
We try to choose only one out of 18 from Jack because we don’t need his outpatient info as we already have a separate group of outpatients and he is considered for Inpatient class.
Once I have one record per person, I am planning to use Binary classification algorithm.
- If you think there is no reason to pick only one record, will I be able to do analysis with multiple records per person? Can you direct me to the resource where I can learn about this theory/technique? What is that approach called?
Is there any other way to do this? Has anyone done similar study?
Hope my question is clear and kindly request you to help me understand on how to do this with simple explanation please?