OHDSI Home | Forums | Wiki | Github

Unsupervised learning (clustering) in OHDSI

Since this issue is related with ultimate solution for causal inference, it’s not easy to solve…

Let me imagine four patients.

A: 54yr Female Non-Hispanic White, No insurance, 5yr type 2 DM, Recent acute myocardial infarction
B: 55yr Male Hispanic, Private insurance, type 1 DM only.
C: 64yr Male, Non-Hispanic Black,Medicaid, 20yr type 2 DM, Recent stroke
D: 75yr Female Non-Hispanic Asian, Medicare, 5yr hypertension, only.

Which two patients are more close than others? How can we determine it?

Thanks so much for thinking of my work as one possible solution. I would love to work on this with you @Adam_Black!

1 Like

Thanks @SCYOU, your question has kept me thinking all weekend :wink:

Revisiting the use cases

I’d first like to revisit the use cases I mentioned earlier, using diabetes as an example:

  1. To better understand a cohort. If diabetes was still an unknown, discovering that it had subtypes would have been worth a Nobel prize. Yet no OHDSI tool would currently allow you to discover that.
  2. As outcomes in estimation studies. The relationship between many exposures (e.g. metformin) and the different types of diabetes is very different, because the underlying biological mechanisms differ.
  3. As effect modifiers. Even if the clusters were unrelated to the disease mechanism, they may still matter. Imagine a hypothetical scenario where we find only two clusters, one for women and one for men, and that these groups have little in common, except they both can suffer from diabetes. Then is would still be good to know this, as they different groups may metabolize the exposure differently, etc. Computing a single average treatment effect across the two groups would make little sense.

The aim of phenotype clustering

I’ve been talking about ‘patient clustering’ before, but what I really meant was ‘phenotype clustering’ (or ‘cohort entry clustering’, which is more correct but less sexy).

I would argue the aim of phenotype clustering is to detect heterogeneity in the phenotype. Does the phenotype constitute several distinct phenomena, that are clearly separated? The way we try to answer that question is basically by seeing if the distribution of features we observe can better be explained as a mixture of distributions.

We could (and should) try to model this as statistical model, as a mixture of (language) models generating the features. However, a more common approach to clustering is by computing distances between data points, and using those distances to cluster (minimizing intra-cluster distances and maximizing inter-cluster differences). One advantage of using distances over global statistical models is that it makes less assumptions about the local topology in feature space. Another advantage is that distances are easily transformed into 2D space, allowing visualization, and thus running the problem through the human visual cortex, which has very advanced (and fast!) clustering algorithms of its own (unfortunately only in 2D or 3D, so actually rather limited).

What determines similarity

To finally get back to @SCYou’s question, similarity (or its complement: distance) tells us how likely it is that two data points (patients / cohort entries) arise from the same distribution. A measure like tf x idf that I’m currently using actually fits very nicely into this interpretation.

An important question though is what features to compare on. In my diabetes example I went in with very little assumptions on what features mattered: I included features from all domains (conditions, drugs, procedures, etc), and within those domains I included all features I could construct (all conditions, all drugs, etc.). Furthermore, the time windows I specified included both time before and after cohort entry, so both events leading up to the diabetes diagnoses (possibly indicating etiology), as well as events afterwards (possibly indicating manifestation of the disease and sequala). The reason why I cast my net so wide is because I wasn’t sure what features would be important. I went in ‘hypothesis free’ (a.k.a. ‘ignorant’, which as you know is my base state :wink: )

An alternative approach would have been to focus on specific features. For example, given that diabetes is a metabolic diseases, I could have focused on metabolic features before the diabetes diagnosis instead. That might have given me better insight into different etiologies (e.g. I could have found clusters based on obesity), and probably would suppress many clusters that have nothing to do with the diabetes phenotype. But I also might have missed the gestational diabetes cluster, which may or may not matter to me.

In the end, phenotype clustering inherently doesn’t have a direct causal interpretation. We use the observed features as proxies for latent, underling processes. We could have informed opinions on what features matter most if we already have a suspicion of what the underlying processes look like. Or we could go in with all the features we have, at the price of generation more noise, but being less likely to miss something (the classical sensitivity-specificity tradeoff).

@schuemie , The following papers would be good examples for this:

Redefining β-blocker response in heart failure patients with sinus rhythm and atrial fibrillation: a machine learning cluster analysis, Lancet, 2021 Aug; Pubmed

Novel subgroups of adult-onset diabetes and their association with outcomes: a data-driven cluster analysis of six variables; Lancet Diabetes & Endocrinology, 2018 May; Pubmed

Even though I have tried to apply autoencoder before, autoencoder itself is very susceptible to the frequency as @schuemie pointed out. Autoencoder usually does not represent rare features (eg. ketoacidosis) well, which are more important than abundant features (eg. test of HbA1c) in medical field. TF-IDF would be a good solution to tackle this challenge.

1 Like

Thanks @SCYou , those are excellent examples of how clustering can answer important clinical questions. I recommend everyone read those two papers.

Just to be clear: I believe that whether an auto-encoder helps or hinders the clustering is a question we should answer with empirical experiments, using real-world data and an a priori established gold standard. I could argue both ways (although I would put my money on it not helping :wink: ).

Using “phenotype clustering” instead of “patient clustering” responds questions that I had a few posts ago.

This is an interesting topic! Clustering patients into different subgroups is very important to gain insights from the data.

I was wondering what is the best method to choose the best clustering algorithm, simply try everything? Since the data is sparse, highly dimensional and many standard clustering techniques will fail, could non-negative matrix factorization be applied? What do you think?

Our company, Productivity Leap (Finland), recently got Ehden SME certified and, therefore, we are really keen on to start working with the community!

@schuemie, how do you think your approach compares to the methods utilised by Soren Brunak’s group in Copenhagen (Nature paper)?

Hi @anna123 ! I don’t know what the best method is. The typical OHSDI approach (used for example in effect-estimation and prediction) is to

  • Run a large-scale methods experiment, using real data where possible, to gain overall insight into the performance of methods. Two example papers are our estimation method benchmark study, and our paper on dealing with right censoring in prediction.
  • When doing a study to answer a clinical question (e.g. clustering a phenotype), choose the most appropriate method for that study based on what we learned from our methods experiment. If multiple methods may be appropriate, let your study diagnostics (e.g. modularity for clustering) in the context of your study guide your final choice.

I hope the discussion in this thread will lead to a large-scale methods experiment as mentioned above. If you’re interested in contributing, you are most welcome!

Hi @nigehughes . Looking at that paper, it seems they take a different approach, focusing on disease trajectories. This seems more similar to our treatment pathways? I think clustering is more exploratory than this.

I’m still playing with the toy diabetes example, just to understand the problem of clustering a bit better. I tried a lot of clustering algorithms, applying them to my cosine similarity of tf x idf matrix, but no luck. Then I realized I was fooled by the nice 2D plot UMAP had created for me: because of the high dimensionality my similarity matrix was actually quite sparse. UMAP wasn’t bothered because it was constrained to two dimensions, so A and B ended up close together because both were close to C, even though A and B had no overlap. I ended up using UMAP to create a low-dimensional representation (5 dimensions), and then running DBSCAN (requiring at least 100 entries per cluster) on that. Below is the result. For every cluster I listed the 8 most frequent features, as well as the 8 most informative (highest tf x idf) features.

Cluster 1

% Most frequent % Most informative
100 Diabetes mellitus 94 Benign neoplasm of intestinal tract
98 Abdominal mass 92 Benign neoplasm of large intestine
98 Mass of digestive structure 95 Benign neoplasm of gastrointestinal tract
98 Mass of trunk 87 Benign neoplasm of colon
97 Type 2 diabetes mellitus 94 Neoplasm of intestinal tract
97 Neoplastic disease 92 Neoplasm of large intestine
96 Benign neoplastic disease 95 Neoplasm of gastrointestinal tract
95 Neoplasm of digestive system 95 Benign tumor of digestive organ

Cluster 2

% Most frequent % Most informative
100 Renal failure syndrome 99 End-stage renal disease
100 Kidney disease 99 Chronic renal failure
100 Diabetes mellitus 92 OTHER ANTIANEMIC PREPARATIONS
100 Renal impairment 92 Other antianemic preparations
100 Chronic disease of genitourinary system 85 Hyperparathyroidism due to renal insufficiency
100 Chronic kidney disease 85 Secondary hyperparathyroidism
99 End-stage renal disease 77 Unlisted dialysis procedure, inpatient or outpatient
99 Chronic renal failure 95 Anemia in chronic kidney disease

Cluster 3

% Most frequent % Most informative
100 Diabetes mellitus 47 Child examination
96 Inflammation of specific body systems 30 Periodic comprehensive preventive medicine reevaluation and …
96 SENSORY ORGANS 33 age group: 10 - 14
96 RESPIRATORY SYSTEM 62 periodic oral evaluation - established patient
95 OPHTHALMOLOGICALS 62 Periodic oral examination
94 ANTIINFLAMMATORY AGENTS AND ANTIINFECTIVES IN COMBINATION 33 topical application of fluoride - excluding varnish
94 Corticosteroids and antiinfectives in combination 33 Topical application of fluoride - excluding varnish
94 OTOLOGICALS 26 age group: 5 - 9

Cluster 4

% Most frequent % Most informative
100 gender = FEMALE 96 Mother delivered
100 Diabetes mellitus 100 Complication of pregnancy, childbirth and/or the puerperium
100 Finding related to pregnancy 97 Complication occurring during pregnancy
100 Complication of pregnancy, childbirth and/or the puerperium 100 Finding related to pregnancy
97 Complication occurring during pregnancy 90 Birth
96 VITAMIN B12 AND FOLIC ACID 89 Livebirth
96 Mother delivered 90 Routine antenatal care
95 RESPIRATORY SYSTEM 88 Single live birth

Cluster 5

% Most frequent % Most informative
100 Diabetes mellitus 66 hydrochlorothiazide
99 CARDIOVASCULAR SYSTEM 66 LOW-CEILING DIURETICS, THIAZIDES
96 Thiazides, combinations with other drugs 66 Thiazides, plain
95 Type 2 diabetes mellitus 66 Alpha and beta blocking agents and thiazides
94 ACE INHIBITORS, COMBINATIONS 69 Low-ceiling diuretics and potassium-sparing agents
92 Diabetes mellitus without complication 77 ACE inhibitors, other combinations
92 ACE inhibitors and diuretics 78 Angiotensin II receptor blockers (ARBs), other combinations
89 LIPID MODIFYING AGENTS, COMBINATIONS 67 Beta blocking agents, non-selective, and thiazides

Cluster 6

% Most frequent % Most informative
100 gender = FEMALE 91 HORMONES AND RELATED AGENTS
100 Diabetes mellitus 91 ENDOCRINE THERAPY
98 GENITO URINARY SYSTEM AND SEX HORMONES 93 HORMONAL CONTRACEPTIVES FOR SYSTEMIC USE
98 SENSORY ORGANS 68 ethinyl estradiol
97 OPHTHALMOLOGICALS 68 Estrogens
96 ANTIINFLAMMATORY AGENTS AND ANTIINFECTIVES IN COMBINATION 92 Progestogens and estrogens, fixed combinations
96 Corticosteroids and antiinfectives in combination 93 Progestogens and estrogens, sequential preparations
95 CORTICOSTEROIDS, COMBINATIONS WITH ANTIBIOTICS 93 PROGESTOGENS AND ESTROGENS IN COMBINATION

Cluster 7

% Most frequent % Most informative
100 Diabetes mellitus 70 glucagon
93 ALIMENTARY TRACT AND METABOLISM 70 PANCREATIC HORMONES
92 Diabetes mellitus without complication 70 GLYCOGENOLYTIC HORMONES
91 Measurement finding outside reference range 70 Glycogenolytic hormones
88 Abnormal glucose level 63 Diabetic ketoacidosis
87 Type 1 diabetes mellitus 63 Metabolic acidosis, increased anion gap (IAG)
87 DRUGS USED IN DIABETES 63 Metabolic acidosis, IAG, accumulation of organic acids
85 Complication due to diabetes mellitus 63 Ketoacidosis

Cluster 8

% Most frequent % Most informative
100 Diabetes mellitus 96 Cocaine
98 NERVOUS SYSTEM 94 Methadone
98 Pain 94 Fentanyl
98 ALIMENTARY TRACT AND METABOLISM 95 Oxycodone
97 Type 2 diabetes mellitus 93 Tramadol
97 Office or other outpatient visit for the evaluation and mana … 87 Tapentadol
97 OTHER ANALGESICS AND ANTIPYRETICS 89 Methylenedioxyamphetamines (MDA, MDEA, MDMA)
97 Salicylic acid and derivatives 89 Buprenorphine

Cluster 9

% Most frequent % Most informative
100 gender = FEMALE 97 Benign neoplasm of uterus
100 Abdominal mass 94 Uterine leiomyoma
100 Pelvic mass 94 Leiomyoma
100 Diabetes mellitus 97 Benign neoplasm of female genital organ
100 Mass of urogenital structure 97 Benign genital neoplasm
100 Mass of female genital structure 98 Neoplasm of uterus
100 Lesion of uterus 99 Mass of uterus
100 Mass of trunk 98 Neoplasm of female genital organ

Cluster 10

% Most frequent % Most informative
100 Diabetes mellitus 82 Dementia
96 Type 2 diabetes mellitus 74 Subsequent nursing facility care, per day, for the evaluatio …
91 Diabetes mellitus without complication 76 Subsequent nursing facility care, per day, for the evaluatio …
88 Hypertensive disorder 59 Dementia associated with another disease
88 Essential hypertension 51 Cerebral degeneration presenting primarily with dementia
88 Type 2 diabetes mellitus without complication 56 Degenerative brain disorder
87 Pain 50 Alzheimer’s disease
85 Pain finding at anatomical site 54 Cerebral degeneration

I found a good example of a benchmark study of clustering algorithms (in this case for text documents). I’d like to design something similar for our health data.

This is interesting! The cluster of gestational diabetes (cluster 4) is still distinctly identified, but now there are two additional clusters of females (clusters 6 and 9). Cluster 6 might be women like in cluster 4 but a few months later, or perhaps not.

If we are trying to identify heterogeneity, would it make sense to consider that findings are more robust if more than one algorithm supports them? I mean: if two clustering algorithms identify the persons in current cluster 4 as a distinct cluster, we are more certain than they are different from the rest of the population than if only one algorithm had identified the cluster. Does this make sense?

Again I don’t know the answer to that, but we could evaluate that in a methods experiment. I think what you’re suggesting is some consensus clustering algorithm that combines two (or more) underlying clustering algorithm. The new consensus clustering algorithm can also be evaluated, just as any individual clustering algorithm.

Given our experiences in the past with combining effect estimation methods, I wouldn’t be surprised if a combination of clustering algorithms actually performs worse than the best single one, but we won’t know until we run a benchmark study,

Sorry for spamming this forum thread.

Thinking about the clustering I produced so far, I must admit I’m not sure what we’ve learned from it. Yes, we now know people with cancer can also have diabetes, and the same is true for other groups of people, including people with hypertension, renal failure, dementia, etc. But does this tell me anything about the diabetes phenotype I used as a starting point?

For comparison, I clustered random people with random visits instead. As you can see in the results below, we find many of the same clusters, and again pregnancy and childbirth is in its own very distinct cluster. We see a new cluster with allergies, but I guess that with slightly different settings we could have found that in the diabetes phenotype as well (people with allergies can also have diabetes).

Perhaps the only noticeable difference is that the diabetes clustering contains a clear type 1 diabetes (with ketoacidosis) cluster.

Cluster 1

% Most frequent % Most informative
99 gender = FEMALE 98 Complication of pregnancy, childbirth and/or the puerperium
99 Finding related to pregnancy 91 Mother delivered
98 Complication of pregnancy, childbirth and/or the puerperium 89 Routine antenatal care
97 SENSORY ORGANS 87 Complication occurring during pregnancy
97 RESPIRATORY SYSTEM 83 Birth
96 OPHTHALMOLOGICALS 99 Finding related to pregnancy
95 VITAMIN B12 AND FOLIC ACID 82 Livebirth
95 GENITO URINARY SYSTEM AND SEX HORMONES 81 Single live birth

Cluster 2

% Most frequent % Most informative
100 gender = FEMALE 79 ethinyl estradiol
99 GENITO URINARY SYSTEM AND SEX HORMONES 79 Estrogens
95 VITAMIN B12 AND FOLIC ACID 90 HORMONES AND RELATED AGENTS
95 SENSORY ORGANS 90 ENDOCRINE THERAPY
94 Natural and semisynthetic estrogens, plain 91 HORMONAL CONTRACEPTIVES FOR SYSTEMIC USE
93 ANTIINFECTIVES FOR SYSTEMIC USE 81 ESTROGENS
92 OPHTHALMOLOGICALS 81 Antiandrogens and estrogens
92 ANTIINFLAMMATORY AGENTS AND ANTIINFECTIVES IN COMBINATION 81 Intravaginal contraceptives

Cluster 3

% Most frequent % Most informative
99 Abdominal mass 94 Benign neoplasm of intestinal tract
99 Mass of trunk 94 Benign neoplasm of large intestine
99 Neoplastic disease 96 Benign neoplasm of gastrointestinal tract
98 Benign neoplastic disease 90 Benign neoplasm of colon
98 Mass of digestive structure 94 Neoplasm of large intestine
98 Neoplasm of trunk 94 Neoplasm of intestinal tract
98 Neoplasm of intra-abdominal organs 97 Mass of intestine
98 Benign neoplasm of abdomen 96 Neoplasm of gastrointestinal tract

Cluster 4

% Most frequent % Most informative
98 Heart disease 67 Chronic heart failure
96 Pain finding at anatomical site 70 Abnormal cardiovascular function
96 Pain 59 Chronic diastolic heart failure
95 Measurement finding outside reference range 62 Diastolic dysfunction
94 Hypertensive disorder 61 Diastolic heart failure
94 Essential hypertension 69 Acute heart disease
90 Pain of truncal structure 54 Hypoxemic respiratory failure
88 Office or other outpatient visit for the evaluation and mana … 52 Acute metabolic disorder

Cluster 5

% Most frequent % Most informative
99 Kidney disease 99 End-stage renal disease
99 Measurement finding outside reference range 99 Chronic renal failure
99 Renal failure syndrome 87 OTHER ANTIANEMIC PREPARATIONS
99 End-stage renal disease 87 Other antianemic preparations
99 Chronic renal failure 77 End-stage renal disease (ESRD) related services monthly, for …
99 Renal impairment 85 Hyperparathyroidism due to renal insufficiency
99 Chronic disease of genitourinary system 85 Secondary hyperparathyroidism
99 BLOOD AND BLOOD FORMING ORGANS 93 Anemia in chronic kidney disease

Cluster 6

% Most frequent % Most informative
99 CARDIOVASCULAR SYSTEM 74 Angiotensin II receptor blockers (ARBs), other combinations
94 ALIMENTARY TRACT AND METABOLISM 73 ACE inhibitors, other combinations
94 Thiazides, combinations with other drugs 77 Guanidine derivatives and diuretics
93 ACE INHIBITORS, COMBINATIONS 77 Alkaloids, excl. rauwolfia, in combination with diuretics
91 NERVOUS SYSTEM 77 MAO inhibitors and diuretics
90 ANTIHYPERTENSIVES AND DIURETICS IN COMBINATION 77 Other antihypertensives and diuretics
89 Hydrazinophthalazine derivatives and diuretics 77 DIURETICS
89 Rauwolfia alkaloids and diuretics in combination 78 Beta blocking agents, non-selective, thiazides and other diu …

Cluster 7

% Most frequent % Most informative
98 Developmental mental disorder 88 Centrally acting sympathomimetics
98 NERVOUS SYSTEM 92 Attention deficit hyperactivity disorder
98 Neurodevelopmental disorder 92 Disorders of attention and motor control
94 SENSORY ORGANS 92 Developmental disorder of motor function
93 PSYCHOANALEPTICS 88 PSYCHOSTIMULANTS, AGENTS USED FOR ADHD AND NOOTROPICS
92 Attention deficit hyperactivity disorder 71 Disruptive behavior disorder
92 Disorders of attention and motor control 65 Imidazoline receptor agonists
92 Developmental disorder of motor function 65 ANTIADRENERGIC AGENTS, CENTRALLY ACTING

Cluster 8

% Most frequent % Most informative
97 Fentanyl 97 Fentanyl
96 SENSORY ORGANS 91 Tramadol
96 NERVOUS SYSTEM 91 Buprenorphine
95 OTHER ANALGESICS AND ANTIPYRETICS 89 Methadone
95 Salicylic acid and derivatives 90 Cocaine
94 OPHTHALMOLOGICALS 83 Methylenedioxyamphetamines (MDA, MDEA, MDMA)
94 ANTIINFLAMMATORY AGENTS AND ANTIINFECTIVES IN COMBINATION 87 Oxycodone
92 ALIMENTARY TRACT AND METABOLISM 88 Benzodiazepines; 1-12

Cluster 9

% Most frequent % Most informative
99 RESPIRATORY SYSTEM 56 Allergic rhinitis due to pollen
98 Inflammation of specific body systems 57 Allergic disorder caused by substance
97 Inflammatory disorder of the respiratory system 69 Chronic disease of immune function
97 Inflammatory disorder of the respiratory tract 62 Seasonal allergic rhinitis
97 Inflammation of specific body organs 39 epinephrine
97 SENSORY ORGANS 39 Alpha- and beta-adrenoreceptor agonists
96 Inflammatory disorder of head 39 Local hemostatics
96 OPHTHALMOLOGICALS 27 Perennial allergic rhinitis

@schuemie Terrific and very thought provoking work. I love this ‘negative control’/random visit clustering experiment, and think it’s quite remarkable that you end up with some similar groups to the diabetes example. This is a strong demonstration and caution (at least for me) about the risks of over-interpreting clusters as having more context-specific meaning than may be warranted.

While it wasn’t the desired targeted use case, I wonder if this may point to a different opportunity. Within many of our studies and tools, including LEGEND, we make decisions on ‘subgroups of interest’, and usually they are based on some simplistic heuristics on logical demographics (e.g. young and old, men and women, Black and White) and sometimes on some biological understanding (e.g. renal compromised and hepato-compromised patients may act differently based on drug metabolism). I wonder if, with a bit more work to verify this hypothesis, we may be identifying subpopulations that are just ‘consistently empirically different’, and that may be grounds to consider them as targets for subgroup analyses in studies, augmenting our ‘expert-curated’ lists. The initial clusters you’ve shown suggest subgroups that easy to rationalize posthoc: persons with cardiovascular disease, persons with cancer, persons with kidney issues, persons with respiratory disease, persons with chronic pain, pregnant women…

All that said, I still like the original motivating use cases, and do think understanding patient heterogeneity within a given phenotype could be an important and necessary component of moving toward precision medicine and connecting together population-level effect estimation and patient-level prediction. But I’m at a loss as to whether unsupervised clustering is part of the solution to get there.

I agree the negative control is great.

I wonder if we need to bring in some supervision. Allow at least a little steering based on the things we care about (outcomes). Aside from pure supervised learning, perhaps there is a middle ground. E.g., some kind of penalty with a learned parameter.

Given our experiences in the past with combining effect estimation methods, I wouldn’t be surprised if a combination of clustering algorithms actually performs worse than the best single one, but we won’t know until we run a benchmark study,

Several moons ago I tried to search literature across different domains to find an example of agreement between any set of clustering algorithms. I didn’t find any such example except in the most trivial of cases. And in my own field of biological networks, I found a complete lack of agreement for algorithms by any measure of mutual information I used. Essentially, the results are highly sensitive to the clustering algorithms that you use (this is independent of the underlying distance metric).

I would say the clustering here is very similar to topic modelling (a bag of words vector approach is somewhat problematic for the reasons you have discussed) - intuitively, a small number of key words will likely distinguish topics or categories (and their relationships) much better than something like the cosine of their word vectors. There has been work to show that topic modelling (finding latent related topics in text corpus) in and graph clustering (aka community detection) are actually the same problem.

However, the optimisation problem has been shown to be very “glassy”. This is one of my favourite papers on the subject - essentially there are a huge number of local optimum solutions to the “modualrity maximisation” measure that are very different from one another. With a simple greedy algorithm, it is really easy to get any one of these solutions - but the overall clusterings produced will differ massively depending on your starting point.

So there is a real trap that you pick a “good” clustering because of bias, rather than uncovering any true latent variables.

I much prefer the idea of using a way of modelling the learning problem as “I have a specific query of interest, I want to use my distance space/network topology to find things related or not related to my query” rather than “I have a group of objects, enumerate the categories”. Both are fishing expeditions, but one has a clear frame of reference that is easier to model (and the reason it’s not a simple problem of training classifier is because you don’t have clear negative labels - i.e. I know that someone is certainly in the type 2 diabetes set but I don’t have a clear definition when they are not).

I wonder if we need to bring in some supervision. Allow at least a little steering based on the things we care about (outcomes). Aside from pure supervised learning, perhaps there is a middle ground. E.g., some kind of penalty with a learned parameter.

The simplest approach for this is actually a random walk with restart (rwr, or, the google page rank algorithm). This could be starting with an ultra-specific set of patients - e.g. a chart review (or even be a single patient). A rwr then ranks the entire data-set. The graph structure that could be used could be as simple as a weighted edges based on the co-occurence of concept terms between patients in the population.

We need to represent a person’s health state at index date as a continuous vector and there are several ways to do that. In addition to evaluating which clustering algorithm works best we can also evaluate different embedding methods.

Exploring a different approach in this discussion thread

t