OHDSI Home | Forums | Wiki | Github

Phenotype basics for beginners

(Akshay Kumar) #1

Hello Everyone,

I am new to healthcare and trying to understand phenotype and how are they useful. By referring the OHDSI Book and posts in the forum, I was able to understand that phenotype is nothing but a set of criteria (clinical characteristics) to select patient of interest from our database. Phenotypes are usually implemented through query language (SQL) etc


  1. Let’s say I would like to select patients who had obesity. I just query on condition table to extract patients who had condition = obesity.Based on my experience, I feel that phenotype/cohort criteria almost always differs for each project that we do, so then what’s the use of having a public version of phenotype algorithm “to identify patients with obesity” in repositories like eMERGE. Why are they made reusable and allowed to reproduce in multiple sites? Am failing to understand why it’s an active area of research and there are advances in phenotyping (when i think it’s all about writing SQL)?

  2. I also read online that they could be used for clinical decision support. I guess it means identifying cohorts for our research through a set of criteria which can kind of help us figure out the differences between cases and controls. Am I right to understand this?

  3. I read that phenotype are better approach to select patients for EHR system (EHR based phenotyping).
    Patient’s disease progression is seen through his EHR but they might biased and incomplete. How are phenotypes useful

  4. I see that there are few phenotype which has multiple levels. Like 1st level looks for Obesity and 2nd level looks for medications related to Obesity and 3rd level looks for lab tests for Obesity etc.

Can someone help me with this?

(Christian Reich) #2


Nice post. You hit the nail on the head defining the problem space:

Reason that is a problem are the resulting lack of efficiency, transparency and effectiveness. In other words, there is no good reason we need to have a gazillion different ways to define one and the same condition. But yet we create a new one each time. Why? Because the existing ones are not reusable (there is no effective framework for such reuse) and because we don’t know if the performance characteristics of the ones folks did before us are sufficient for our use case. It then becomes easier to create a new one, rather than figuring out the ones that exist.

This is the crux why we do not have one good definition for each phenotype. The reason is how medicine is practiced: Instead of regularly characterizing a holistic patient, medicine is a business of fixing issues that people are unpleasantly aware of (they suffer). As a result, those issues (signs and symptoms), underlying cause (diagnositics and diagnoses) and the remedies (procedure and drug treatments) get recorded. But rarely all of them. Not “few” phenotypes have multiple levels, but all of them do, really.

Take obesity. As you said, it can be recorded in different ways, depending how the patient presents. If the patient is suffering from consequences of obesity (any of the problems with the endocrine, cardiovascular or musculoskeletal system) it will get diagnosed as such. If a chubby person walks into the doctor’s office to be treated for a, say, hey fever, it will not. Instead, as part of the general admission process, the weight will be recorded. And obesity is a condition ridiculously easy to spot. For all the others there is a myriad of ways how they will become apparent, based on what the patient presents at first encounter and how the workup goes. Now imagine you want to exclude the presence of a condition, and instead of looking for one or a few good hallmarks you now need to play the game of whackamole and make sure none of the potential indicators are positive.

This is why we end up with more than one definition. You need tacit knowledge of how people perceive themselves, how the healthcare system works, how reliable different healthcare events occur, and how reliably these are captured into records. And you have a different tolerance for misclassification (what’s good enough).

So, yes, it’s a big problem and solving it is a scientific achievement.

(Akshay Kumar) #3

@Christian_Reich - Thanks very much for helping me with this. Your detailed explanation is very useful for a beginner like me.

In addition, if I would like to validate phenotype algorithms (from eMERGE repository), is it about me just creating cohort definitions in atlas as per their criteria (ex: obseity phenotype definition, T2DM phenotype definition etc) and see how many True Positive, True Negative etc.?

I also see that OHDSI has a Phevaluator, so is it the same?

(Christian Reich) #4


PheValuator will do exactly that for you. It will build a probabilistic model from the data (not involved in the phenotype definition), use that as the truth and then compare. The problem with that is that the PheValuator model has it’s own performance characteristics, which may not be that good or that close to the truth. In other words it tries to use all other data to second guess how well your phenotype definition does.