Given our experiences in the past with combining effect estimation methods, I wouldn’t be surprised if a combination of clustering algorithms actually performs worse than the best single one, but we won’t know until we run a benchmark study,
Several moons ago I tried to search literature across different domains to find an example of agreement between any set of clustering algorithms. I didn’t find any such example except in the most trivial of cases. And in my own field of biological networks, I found a complete lack of agreement for algorithms by any measure of mutual information I used. Essentially, the results are highly sensitive to the clustering algorithms that you use (this is independent of the underlying distance metric).
I would say the clustering here is very similar to topic modelling (a bag of words vector approach is somewhat problematic for the reasons you have discussed) - intuitively, a small number of key words will likely distinguish topics or categories (and their relationships) much better than something like the cosine of their word vectors. There has been work to show that topic modelling (finding latent related topics in text corpus) in and graph clustering (aka community detection) are actually the same problem.
However, the optimisation problem has been shown to be very “glassy”. This is one of my favourite papers on the subject - essentially there are a huge number of local optimum solutions to the “modualrity maximisation” measure that are very different from one another. With a simple greedy algorithm, it is really easy to get any one of these solutions - but the overall clusterings produced will differ massively depending on your starting point.
So there is a real trap that you pick a “good” clustering because of bias, rather than uncovering any true latent variables.
I much prefer the idea of using a way of modelling the learning problem as “I have a specific query of interest, I want to use my distance space/network topology to find things related or not related to my query” rather than “I have a group of objects, enumerate the categories”. Both are fishing expeditions, but one has a clear frame of reference that is easier to model (and the reason it’s not a simple problem of training classifier is because you don’t have clear negative labels - i.e. I know that someone is certainly in the type 2 diabetes set but I don’t have a clear definition when they are not).
I wonder if we need to bring in some supervision. Allow at least a little steering based on the things we care about (outcomes). Aside from pure supervised learning, perhaps there is a middle ground. E.g., some kind of penalty with a learned parameter.
The simplest approach for this is actually a random walk with restart (rwr, or, the google page rank algorithm). This could be starting with an ultra-specific set of patients - e.g. a chart review (or even be a single patient). A rwr then ranks the entire data-set. The graph structure that could be used could be as simple as a weighted edges based on the co-occurence of concept terms between patients in the population.