OHDSI Home | Forums | Wiki | Github

On 'phenotype flavours'

Hi, Just adding my 2 cents:

I think we need to draw the line between ‘features’ and ‘phenotypes’. Features, in my experience, do not have precise dates around them (otherwise, you might sort features for a person by time relative to index, which I don’t think anyone does). Phenotypes are supposed to represent the period of time of an ‘active status of a clinical concept’, such as a disease state or treatment status.

I see a few problems with H/O codes with respect to ‘active state’: May I use an example of ‘History of nosebleeds’? Ok, so you have a visit, and you record history of nosebleeds. Is this an active status? No, if it was they would record a nosebleed. Does the H/O code represent a single event? Probably not, if you say ‘I have a history of nosebleeds’ it probably means it’s a recurring problem. How do you get an index date from a statement of a recurrence of problems? This is where @Gowtham_Rao is talking about ‘not using H/O codes as entry events of a cohort’, because the H/O codes only state one thing about timing: the actual event happened some time prior to the date the H/O code was recorded. From a ‘active state of a phenotype’ perspective, it’s erroneous. However, if you have an entry event (index date) that identifies something ‘active’ like ‘exposed to blood thinners’, and you want to limit these index events to those with a prior nosebleed, you would have an inclusion criteria of either ‘active nosebleeds’ or ‘H/O nosebleeds’, therefore using both active codes and H/O codes in your phenotype. If you want to say something about ‘recent’ nosebleeds, however, you have to think carefully if a H/O code can represent the appropriate fidelity of recent or not, and this is where you make the choice to use the code or not.

Prevalence vs. Incidence:
I have found that there’s a lot of debase about prevalence and incidence, but one thing I’ve realized is that both prevalence and incidence should be using ‘active state’ of a phenotype to determine either. The difference between prevalence and incidence is simply how the ‘active state’ overlaps the specified time-at-risk: if the active state starts during a time at risk, it’s incidence, if it overlaps with the time at risk, it is prevalence. I’m sure this perspective is controversial, but it’s the definition I use and from this perspective, dates matter.

Features vs. Phenotypes
So, with the above said, I think H/O codes have utility in the context of features. I also think phenotypes (which are identifying active state) can be leveraged to create features of their own. The principle that ‘phenotypes are not code lists’ should resonate here for everyone, in that you can create features for predictive models from H/O codes, you can create them from phenotypes, but when you push H/O codes into phenotypes, I agree with @Gowtham_Rao that H/O codes for entry events have limitations that should be carefully considered.


Dates do and don’t mater. I think @Daniel_Prieto has raised this above. In 2002 in a database far far away I had a heart attack and forever entered the fraternal fellowship cohort on that date of “past medical history acute myocardial infarction.” I join a new database in 2010 and for record keeping they want to record my fraternal membership in this cohort of past medical history of acute myocardial infarction. A researcher comes along and wants to define the 2013 prevalence of those who are members of the fraternal fellowship of past medical history acute myocardial infarction. The H/O code has value in addresing this research question.

For the record I’ve never had a heart attack but without longitudinal data linkage or a diagnosis code of H/O AMI your database will never know my true past!


Hey @Chris_Knoll thanks for your response.
I agree with 99% of what you say, with only one exception: there are conditions (so called ‘chronic’) that unfortunately once you have, you have forever. Hence the use of ‘migraine’ as an example. But the same would be true for hypertension, dementia, and likely diabetes. With very few exceptions, a ‘h/o’ any of these means you do not only have a ‘history of’ but also an ‘active status’ of these.
I think I failed to communicate that my ‘h/o phenotype’ (which again, was only an example of a ‘phenotype flavour’) was of interest for chronic conditions, not for acute ones like nose bleed or covid19 or flu or pneumonia

Thanks again!

My ‘overarching’ question, however, was not whether I should or not use H/O but rather if anyone has looked into the use of ‘phenotype flavours’ or is interested to look into this. See here:

Given the long thread, varied opinions, and various data representations (see Gowtham’s characterisation for an example), it looks like maybe there is a need for research on this topic…

1 Like

…'tis the season to stir up some OHDSI family drama. :santa:

In reading this chain, it seems to be the old adage of the two wolves of OHDSI research: 1) making study decisions for a database that bring out the best of that database’s contributions to secondary use of data and 2) making study decisions in the spirit of network science (many databases) (some of what @Gowtham_Rao’s alluding to)

As secondary users of data, we don’t control the hand we’re dealt. Primary care databases are going to have some uniqueness that we shouldn’t disregard. This is a whole class of type of care setting. I trust @Daniel_Prieto @edburn @tduarte and others have many lessons they can elucidate for the community here.

I agree with this… and not just because I need a DPhil thesis in the next 6 years. :wink:

@Kevin_Haynes has a few solid gems in this thread. I appreciate the levity on this dialogue. It’s true. The US healthcare system, in particular, contributes a lot of fragmentation of care. Now, that again, is actually an adaptation of Wolf #1 above.

Some more levity: I recently reviewed my own medical chart and found a flow sheet that said I have 1 child. (Reader: to my knowledge, I have never had a child.) Apparently, I talk about my dog enough that a provider recorded it as “has a child at home.” So yeah… real world data are flawed. We know this.

Whether we like it or not, flavours seem to be a natural hazard of RWD.

Remind me, what did we find in that first OMOP experiment, @Christian_Reich?? I feel there is something in our history that could help this thread but my brain is foggy.


Wow. @krfeeney is dragging me into the row. And I had so much fun just watching.

The thing is this: If we had perfect databases, i.e., with everything recorded from birth to death, and when, we wouldn’t have a problem. @Gowtham_Rao were right, and we would create these clean phenotypes, and they would do the job no matter what use case you employ them in. In fact, we wouldn’t need phenotypes, because the CONDITION_OCCURRENCE records could be taken at face value.

But we don’t. In particular, the Condition records have lousy specificity, sensitivity or timeliness. The databases deal with that differently. For example, chronic diseases are recorded only once, or a repetition of a condition record might mean its re-confirmation (what @Kevin_Haynes keeps telling us). In acute diseases a record could mean symptomatic onset, or the disposition to those: Is “Asthma” an acute asthma attack, or the atopic disease making the patient susceptible for the attacks? There are mixed ones, like syphilis or tuberculosis.

So, yes, we not only need @Daniel_Prieto’s flavors, we need to explicitly say what each flavor is optimized for, and how:

  • For specificity - using two diabetes concepts within one month
  • For sensitivity - using any implication of diabetes in conditions, procedures, drugs, whatever
  • For cohort start date (index) - using a key diagnostic procedure as index date
  • For cohort end date (so you can calculate in-patient rates) - using some indication of the condition having subsided
  • For use in an exclusion criterion - using h/o

Let’s make a good set of flavors, and create those phenotypes. Be very explicit what the optimization function is, and annotate the criteria with comments explaining what each of them is supposed to be achieving.



Bottom line - we need good quality cohort definitions that have gone thru best practice cohort definition development and evaluation. We have good infrastructure (OHDSI Phenotype Library) and it can take an infinite number of cohorts and various phenotype flavors.

But, there are standards to get a cohort in - as defined by the OHDSI Phenotype Development and Evaluation workgroup. There are already 127 cohorts here https://data.ohdsi.org/PhenotypeLibrary that have gone in but havent completed peer review (all cohorts with [P] prefix).

It would be great if we all work together on this and make progress - we need

  1. We need people to make contributions to the OHDSI PL
  2. We need peer reviewers to review it and critique it

Hi all, I have followed this discussion with great interest and have been working on Parkinson’s disease and parkinsonism cohort defintions. We are not yet up on an OMOP-CDM model but anticipate getting that up and running by early next year so our team can contribute more meaningfully. We had a collaborative group of stakeholders in a demonstration project for the California PD Registry where we wanted to be (a) inclusive of all neurodgenerative parkinsonisms but have the (b) flavors (in this thread) that could handle the specificity of PD and the less specific but still critically important possible types of PD. We ended up with a nice consensus “classificaiton” system we think is amenable (theoretically) for scalable phenotypes. We are working to publish this to see if has legs in the PD/parkinsonism community.

I’ve attached a brief overview of this we presented to the Calif PD Registry and the CDC and we do anticipate further refinement as we go forward.

Happy to help contribute, esp once we get our OMOP-CDM going so we can test out various ways that each flavor could be identified (and to characterize those).

This is fantastic @allanwu

The right group for this conversation would be the OHDSI Phenotype Development and Evaluation workgroup. We meet every two week (second and fourth Friday). It sounds like the next steps is to build the algorithms to identify persons in your data source (once you have the OMOP-CDM). If you need help or would like to talk to this workgroup - please let me know.