And now for something completely different. I just reviewed a paper on unsupervised clustering of patients, and realized there’s not much done on this topic in OHDSI. Perhaps this is because it doesn’t fall neatly in our buckets of characterization, population-level estimation, or patient-level prediction. Perhaps it is because it is a bit of a soft topic: when is a clustering good?
So the topic is: given a cohort of patients, each patient having a set of features, can we cluster the patients so that patients with similar feature vectors end up in the same cluster? (Not to be mistaken with clustering of features)
I think it would be valuable to do methods research on this, and eventually create standardized tools (a HADES package) that supports this. These are the use cases I envision:
-
To better understand a cohort. For example, given some not-so-well understood disease, a perfectly valid question is “is everyone that has this disease the same?” If we identify clear subgroups, we may conclude there are different disease subtypes, maybe with different etiologies, which would be important for further investigations.
-
As outcomes in estimation studies. Arguably, the outcome definitions we currently use are a bit arbitrary. Often outcomes are collections of symptoms that medicine has deemed to constitute a disease, but the causal relationship with the exposure might be with something different, something we could derive from the data.
-
As effect modifiers. Possibly the effect of the exposure on the outcome is different in different subgroups of patients. One way to identify groups is through clustering.
There may be other uses cases.
Clustering of patients is non-trivial. Here are some of the challenges:
-
Our data tend to be highly sparse, meaning many standard clustering techniques will fail.
-
Our data tend to be big (many features and many patients), causing many algorithm implementations to fail.
-
Evaluating a clustering is hard. We can eyeball the clusters to see if they make sense, but are there more objective measures?
-
Visualizing and (automatically) annotating clusters is hard. How do we convey the essence of a grouping that was derived automatically through some complicated algorithm?
-
Can we cluster across sites in the OHDSI network? And can we apply a clustering from one site to patients in another?
What do people think? Is this something people would like to explore? Are people already working on this?
Looping in @msuchard , @hripcsa , @Patrick_Ryan , @Rijnbeek , @chenyong1203 , @Daniel_Prieto