Thanks @SCYOU, your question has kept me thinking all weekend
Revisiting the use cases
I’d first like to revisit the use cases I mentioned earlier, using diabetes as an example:
-
To better understand a cohort. If diabetes was still an unknown, discovering that it had subtypes would have been worth a Nobel prize. Yet no OHDSI tool would currently allow you to discover that.
-
As outcomes in estimation studies. The relationship between many exposures (e.g. metformin) and the different types of diabetes is very different, because the underlying biological mechanisms differ.
-
As effect modifiers. Even if the clusters were unrelated to the disease mechanism, they may still matter. Imagine a hypothetical scenario where we find only two clusters, one for women and one for men, and that these groups have little in common, except they both can suffer from diabetes. Then is would still be good to know this, as they different groups may metabolize the exposure differently, etc. Computing a single average treatment effect across the two groups would make little sense.
The aim of phenotype clustering
I’ve been talking about ‘patient clustering’ before, but what I really meant was ‘phenotype clustering’ (or ‘cohort entry clustering’, which is more correct but less sexy).
I would argue the aim of phenotype clustering is to detect heterogeneity in the phenotype. Does the phenotype constitute several distinct phenomena, that are clearly separated? The way we try to answer that question is basically by seeing if the distribution of features we observe can better be explained as a mixture of distributions.
We could (and should) try to model this as statistical model, as a mixture of (language) models generating the features. However, a more common approach to clustering is by computing distances between data points, and using those distances to cluster (minimizing intra-cluster distances and maximizing inter-cluster differences). One advantage of using distances over global statistical models is that it makes less assumptions about the local topology in feature space. Another advantage is that distances are easily transformed into 2D space, allowing visualization, and thus running the problem through the human visual cortex, which has very advanced (and fast!) clustering algorithms of its own (unfortunately only in 2D or 3D, so actually rather limited).
What determines similarity
To finally get back to @SCYou’s question, similarity (or its complement: distance) tells us how likely it is that two data points (patients / cohort entries) arise from the same distribution. A measure like tf x idf that I’m currently using actually fits very nicely into this interpretation.
An important question though is what features to compare on. In my diabetes example I went in with very little assumptions on what features mattered: I included features from all domains (conditions, drugs, procedures, etc), and within those domains I included all features I could construct (all conditions, all drugs, etc.). Furthermore, the time windows I specified included both time before and after cohort entry, so both events leading up to the diabetes diagnoses (possibly indicating etiology), as well as events afterwards (possibly indicating manifestation of the disease and sequala). The reason why I cast my net so wide is because I wasn’t sure what features would be important. I went in ‘hypothesis free’ (a.k.a. ‘ignorant’, which as you know is my base state )
An alternative approach would have been to focus on specific features. For example, given that diabetes is a metabolic diseases, I could have focused on metabolic features before the diabetes diagnosis instead. That might have given me better insight into different etiologies (e.g. I could have found clusters based on obesity), and probably would suppress many clusters that have nothing to do with the diabetes phenotype. But I also might have missed the gestational diabetes cluster, which may or may not matter to me.
In the end, phenotype clustering inherently doesn’t have a direct causal interpretation. We use the observed features as proxies for latent, underling processes. We could have informed opinions on what features matter most if we already have a suspicion of what the underlying processes look like. Or we could go in with all the features we have, at the price of generation more noise, but being less likely to miss something (the classical sensitivity-specificity tradeoff).