OHDSI Home | Forums | Wiki | Github

Race and Ethnicity in the OMOP CDM


Recently, the debate flared up in the CDM Working Group about Ethnicities and Races, and whether the current model is adequate or needs changing. And it keeps coming back and back and back:

I am happy to admit that personally I have a strong aversion against all the tribalism and identity assignments of people, and the fact that we are spending so much time on something that has not yielded a single use case so far for OHDSI. Yes, it is easy, everybody believes he or she “understands” races, and therefore we can endlessly debate it. Gender/Sex debates are equally popular and fruitless from the perspective of us generating evidence that promotes better health decisions and better care, as our mission requires.

But let me set my negative affections aside. Let’s discuss what we really might need, and how it should be represented.

The current ETHNICITY_CONCEPT_ID/RACE_CONCEPT_ID clearly do not work well: They are following some US Office of Management and Budget guidelines for how the US Census should deal with it. The whole Ethnicity “Hispanic” or “Not Hispanic” we have now is because in one of the censuses the third largest race was “Other”, and it was picked mostly by Hispanics. So, a separate entity was created to deal with that. For us I would argue it does not make sense, and more importantly, cannot work anywhere else in the world. In Korea, the split of ethnicities into “Hispanic” and “Not Hispanic” is ridiculous.

The question is could there be a clean solution? Currently, race and ethnicity definitions are based on the assumption that people can be meaningfully clustered based on similarities on the genetic, phenotypic, geographical co-location, social and/or religious attributes. Which one? Whatever works, it seems:

  • Genetic: Obviously, our genes get inherited according to clear rules, and we have haplotypes (genes on the same chromosomes that cannot be easily inherited independently from each other). But as we know from the Genome project our alleles have a cosmopolitan distribution, like that of marine fish and birds. The distribution is not equal, but you have all alleles in all currently defined races. And the difference between any two people within a race on average is higher than between the average of the races. The best illustration for that is that the White Americans Craig Venter and James Watson are more different from each other than each of them to the Korean geneticist Kim Seong-Jin, all of which are amongst the first people having their genome fully sequenced.
  • Phenotypic: I am not even starting. None of these attributes we think we so easily recognize have a common foundation allow unambiguous race distinction: Just one example: There are dark-skinned people in South Asia and various native races. And the thing breaks down when people mix. Obama is black? Why? He is just as white.
  • Geographical co-location: yes in the pre-industrial world people couldn’t easily move around, and current ethnicities are defined through ancestry from a geographical region. Many people have no clue about the geographic location “they are from”, as the ancestry service commercials nicely show. We discussed the concept of the “French Canadian”, which is a geographical location that arguable was established rather recently based on mass-migration. From where? All over the place.
  • Cultural and religious: don’t get me started. That stuff is completely elusive and only works locally. Apart from their language, the “Hispanics” in the Americas have a very different cultural distinction from the others compared to those in Europe. And obviously, these attributes can be easily changed. I can change my religion. Hey, I actually did a couple times (on paper, happy to tell what happened over beer). I am sure my outcomes are still the same.

People argue that we might not be able to properly define it, but the self-assigned race and ethnicity have strong correlations to disease progression and treatment outcomes, as studies show all the time. So, something must be going on. I agree. But it is a mixture of all the above, which drive social and economical disparities, and often these features not being a function of the individual but that of the society around them. Except we have no way of capturing this globally.

Another argument I heard was that maybe we should accept the sub-optimal definition of the thing, since other Domains also have inaccurate concepts (e.g. syndromes in Condition). Far from it. In Condition, we may have some blurring on the edges, but most of the Conditions are pretty clean, and they work all over the world. While race and ethnicity are never clean. Not one of them.

Bottom line: There are personal or societal features of individuals that allow for some clustering. But these clusters have no definable cut-off lines, they are only reproducible locally (if that) and over historically short periods of time. In other words: We are all just Kenyans, who did some hanky panky with the neanderthals.

Here are pragmatic proposals of how to deal with this, removing the obvious flaws of the current system, but allowing the arsenal of OHDSI methods (including preference scoring) to continue working.

  1. We get rid of ETHNICITY_CONCEPT_ID, and continue only with the RACE_CONCEPT_ID, using the mixed Race-Ethnicity Concept hierarchy we currently have (but include the hispanics).
  2. We keep both ETHNICITY_CONCEPT_ID and RACE_CONCEPT_ID, but we split off the lower half of the Race/Ethnicity Concept hierarchy to and form a new Ethnicity Domain (including the hispanics). It that world, any ethnicity can be combined with any race.
  3. We drop the Race and Ethnicity Domain Concepts until such time that we can properly model biological and socio-economical attributes of people on a global level. Till then, folks can use their own >2B local concepts.

Sorry it got so long. Any thoughts?


@Christian_Reich, what do you mean no use cases? Our community published a paper using race and ethnicity data from the OHDSI network (https://academic.oup.com/jamia/article-abstract/26/8-9/730/5542028)…and concluded that “quality of this information in observational databases is concerning” :slight_smile:

I agree it’s an important discussion to try to align on ETL conventions and user guide for the OMOP Common Data Model. While I also have a negative aversion to an artificial social construct, I have an even larger negative conversion to changing our data standard from one suboptimal solution to another suboptimal solution, without a precipitating use case.

I would recommend we properly document the assumptions and limitations and current utility of the the current structure and vocabulary. And when the time comes that someone in the community has a research question that is aligned to the OHDSI mission for which the current standard is insufficient, then we can discuss alternative solutions and evaluate when the OHDSI community would be prepared to change their data infrastructure to accommodate the change.

Hi @Christian_Reich and @Patrick_Ryan, another solution is to look for a well-defined ontology which represents the race and ethnicity in such a way that reaches common agreement in OHDSI community. However, I think in the ontology world, people have different opinion about what is race and what is ethnicity. The point is we need to come up with one agreeable definition and stick with it.
I agree with @Patrick_Ryan’s strategy that:

Documentation and guidance is very important for people to reference.


Christian thanks for your persistence and patience in getting this settled. Following your lead, I’ll confess my biases. I specialized in cross-cultural psychology and health psychology as part of my doctoral training in clinical psychology. I believe that the cultural meaning of an Ethnicity concept is important to the study of health and can be measured reliably though not as reliably as we’d like. So we should try to support it somewhere in the CDM.

Here’s a start on documenting some relevant “assumptions and limitations and current utility of the the current structure and vocabulary” as Patrick requested.

The Ethnicity definition that makes the most sense to me is as a demographic attribute that is distinct from and complementary to Race. More specifically, I suggest we define Ethnicity as an indicator or ethnocultural identity that a person ascribes to themselves. Ethnocultural identity refers to the extent to which an individual endorses and manifests the cultural traditions and practices of a particular group. Culture, in this sense, is the collection of historically defined beliefs, practices, and attitudes shared by a community.

Yes, it is messy. The relationship to language, religion, civic institutions, food choice, reading habits, recreational activities, etc. is complex and not absolute. The same is true, however, of the definition of some medical conditions. The definition of depression is also messy, for example. It doesn’t depend on fixed absolute characteristics that are free from subjective interpretation. Feeling worthless or guilty is one of about 9 symptoms that can contribute to the diagnosis of depression but it isn’t required. So one person gets the same Condition concept assigned to them for different reasons than another. Do we all understand feeling worthless or guilty in the same way? No we don’t. So the same symptom means different things to different people. It’s messy. Does that mean depression doesn’t have a clean enough definition to be reliably measured and used in reproducible research? No it doesn’t. The fabulous LEGEND studies demonstrate this clearly. It just means the concept definition is messy. We can and should find a way to deal with messy concepts when they are widely collected and can lead to important insights about health.

And Ethnicity as culture is important. It’s important because it’s associated with attitudes toward treatment seeking and other health behaviors and in some places it is a useful proxy for social factors that affect access to care and differences in how ostensibly equivalent care is delivered. It may affect my trust of medical institutions or my belief in effectiveness of “modern medicine”, or how I relate to providers or how they relate to me. etc…

The information we need to define ethnicity as a social determinant of health isn’t subsumed by a Race concept, even one with a suitably granular value set. A person of the same race, e.g. White, might or might not identify with an ethnicity e.g. Hispanic, for reasons that correlate or determine their health-related attitudes, behaviors, and access issues. That’s why it’s distinct from and complementary to Race.

It isn’t important whether or not ethnicity so defined might really be a proxy for wealth, education, or other factors. First of all we don’t know whether and to what extent that’s true in many cases. That is a topic for researchers to sort out rather than one we should prejudge, in my opinion. Second, we happily use a zillion concepts that are proxies for underlying causes. My blood sugar isn’t an invalid concept because it a proxy for underlying glucometabolic processes.

I agree that our goal is standardization and that standardization implies a clear definition that can be consistently applied across different regions, times, individuals, etc… The definition of ethnicity as culture isn’t ever likely to be as clear and reliable as we want. Your French-Canadian example is a good one. The situation is Maine, is interesting. French speakers in Maine were systematically discriminated against for generations - not allowed to hold management-level jobs, shamed for speaking French, etc. There are many communities in Maine that still largely descendants of Francophone populations. They’re rural and not nearly as healthy as communities with non-Francophone descendants. If sites have the data to study related questions, do we want to prohibit them from trying because the concepts are messy? That seems hard to justify given the fact that the approaches in OHDSI have been successfully applied to other messy concepts like depression.

The OHDSI convention, as Clair brought up on the CDM WG call, is to treat self-reported health-relevant information as Observations and store it in the OBSERVATION rather than the PERSON table as a marker of the greater subjectivity or unreliability of the information. Since definitions of Ethnicity as culture are likely to vary across region and since they rely on largely inscrutable personal definitions, this seems like a reasonable solution to me. Though there are good behavioral scales for measuring ethnocultural identity, they aren’t widely used. The degree of inconsistency seems generally greater than for depression, in other words, and for the reasons described in the paper Patrick cites, warrants a different level of trust.

In addition to where we store ethnicity information, we might want to define conventions for how we represent ethnicity-as-culture. Hispanic/non-Hispanic is not granular enough to be very useful. There is an OBO-compliant Ethnicity Ontology (EO). After a quick scan, it seems pretty comprehensive. Maybe it’s worth a more detailed look to see if it meets the Asiyah’s criteria for a well-defined ontology.

I apologize for the length of my reply, but you started it! :slight_smile:

1 Like

I’d second @Patrick_Ryan’s assertion - we have some internal use cases as well.

As I remember arguing in an earlier thread, and as you state here, @Christian_Reich, just because race/ethnicity don’t have a solid, immutable biophysical grounding doesn’t mean that they don’t have a concrete impact on health outcomes, whether this is driven by underlying molecular variation (e.g. well-documented differences in diabetes phenotypic expression between people with East Asian vs. European ancestry), sociocultural factors (e.g. higher incidence of hypertension in Black Americans), or some combination of the two.

That said, I share your frustration with the lack of standardization in how this is documented. In a recent JAMIA paper, our group described the results of a natural language processing pipeline designed to extract race and ethnicity from free text notes. In short, we not only found that many patients from underserved populations didn’t have structured documentation of race/ethnicity, but also that patients from these populations who lacked structured data differed significantly from those who had it! In putting together this work, we relied heavily on the OMB definitions you mention above, but a lot of the free text documentation was very different - as you mention, people tend to conflate race and ethnicity, and a lot of Hispanic people don’t seem to agree with the gov’t idea that you can be Black Hispanic, White Hispanic, etc.

I continue to go back to the idea of fit-for-purpose. Just as there might be some different vocabularies in use in different countries, the distribution of race/ethnicity codes is going to be different between countries. For the US, OMB definitions are not perfect, but I think it’s best to just keep using them for US data sets and let other countries do whatever’s most appropriate for them. Even if we end up with 0 for race_concept_id and Hispanic for ethnicity_concept_id, that still lets us ask questions like “are Hispanic patients at higher risk of X?” Folks in other countries where, say, Sami is a more pertinent ethnicity than Hispanic can map how they see fit, and any analyses that depend on international data will have to take the differing mappings into account. This also, to me, ties in with the idea that the impact of race/ethnicity on health outcomes is going to be specific to the sociocultural context of race/ethnicity - impact of being Black in America on your health is going to be very different than the impact of being Black in Egypt or Brazil. I don’t think there’s any trickery we can do with vocab or mapping that will get around the local peculiarities of the relationship between race/ethnicity and health.

@esholle Your comment is so insightful. Where can I find your NLP result for the race and ethnicity?
@Andrew thanks for finding the EO (EthnicityOntology) . I did a quick check, and found that it is built to map of CTV3, READ2 code and SNOMED codes used in UK. The ethnicity is more aligned with UK users rather than US. With no textual definitions provided for each term, people may use it in different ways. But definitely one can learn from their classes and subclasses. The EO maybe extended to accommodate other countries’ use. It is not a ready-in-use solution for here.
I agree with @esholle, the health outcomes of being Black-African American is different with being Black in other countries.

Thanks for your kind words, @AsiyahFDA. I edited the post to include a link to the paper.

Is there any way we could use location/region concepts that were introduced to specify someone’s ancestry, culture, etc? Maybe thinking of these attributes as a regional one will also provide a mechanism to roll-up groups to higher-level aggregates.


I initially wrote another long exposé about everything that’s wrong with that research on race and health outcomes, particularly if you want to do it internationally, but then deleted it. Reason:

Let’s focus on that. I.e. before we go into solution space let’s define the problem space. Here is what I can glean from your comments:

  • Is “ethnicity … a proxy for wealth, education, or other factors”?
  • Are “descendants of Francophone population… not nearly as healthy as communities with non-Francophone descendants”?
  • Do “patients from underserved populations … have no structured documentation of race/ethnicity” and do “patients from these populations who lacked structured data differ significantly from those who had it”?
  • Is the “impact of being Black in America on your health … very different than the impact of being Black in Egypt or Brazil”?

Is that it? And from all I can see you are saying the current OMB thing works, right? Do we not have a problem? Because again, without a problem there is no need to work on a solution.

There probably is, but how many of our data sources have information that detailed about people’s ancestry? I don’t know much about claims data, but I can’t imagine there’s much there. And in US EHR data, I’m pretty sure no one’s distinguishing between where someone’s grandparents came from. Plus all of the health-relevant concepts, e.g. potential Tay-Sachs carrier, Hb/ss status, are already mapped in the vocab, no?

Some thoughts:

  • Is “ethnicity … a proxy for wealth, education, or other factors”?

Maybe, but probably not in a very interesting way, and probably not as good of a proxy as insurance status or neighborhood-level variables, which, for US pts anyway, we can get by geocoding addresses. Plus, we’d probably be more interested in looking at the role it plays after you control for wealth, education, etc. anyway.

Do “patients from underserved populations … have no structured documentation of race/ethnicity”

Obviously they don’t have no documentation, but I suspect they’re less likely to, (and, more broadly, that data quality issues disproportionately impact these communities). That said, we’d need a gold standard to be sure. Probably the next line of research after that JAMIA paper. I can also check to see if our NLP pipeline “filled in the blanks” as disproportionately Black/Hispanic for the NULLs/DECLINEDs/OTHERs. That said, this is another line of research, and one that seems to me to be only tangentially relevant to the question at hand: how do we handle the mapping of race/ethnicity for the OHDSI community.

  • do “patients from these populations who lacked structured data differ significantly from those who had it”?

At least at our institution, they definitely do - see above ref.

Is the “impact of being Black in America on your health … very different than the impact of being Black in Egypt or Brazil”?

This is the kind of question that militates for inclusion and good mapping of race/ethnicity in OHDSI! If we’re using the same terminology for these concepts across all of our databases, we can ask and answer questions like this.

And from all I can see you are saying the current OMB thing works, right? Do we not have a problem? Because again, without a problem there is no need to work on a solution.

I think I was saying that it works for US data sets, but that ultimately the way we do it is going to depend on the specific question we’re asking and the peculiarities of the source data. I know that the whole point of OHDSI is that we all pick one way to do it (the way that lets us ask the maximum amount of questions) but I think this is an area where we’re never going to get it 100% right. I think the best we can do is use OMB mapping for US data sets, try to coerce non-US data sets to the OMB mapping if we want to ask questions about race across internat’l borders (like the Brazil thread from a while ago), but in general, adhere to the fit-for-purpose standard.

So basically, I’m saying that I don’t think we have a problem. Patrick and Fernanda were able to publish their paper, we have internal use cases for using the data in its current format to assess disparities in quality of care, etc…I can see how it wouldn’t work for source data from other countries, but until/unless we have compelling use cases from those countries, why do anything? And if/when we do, I think you hit the nail on the head in your initial post:

keep both ETHNICITY_CONCEPT_ID and RACE_CONCEPT_ID, but we split off the lower half of the Race/Ethnicity Concept hierarchy to and form a new Ethnicity Domain (including the hispanics). It that world, any ethnicity can be combined with any race.

This way people with Swedish data sets can map Sami as an ethnicity, and people with US data sets can map HIspanic/non-Hispanic.

You illustrated out a use case of combining US and OUS data on race and ethnicities.
+1 from me.

@Christian_Reich I think the current standard had problems with the domain (Person vs Observation) definition (relationship to race) and granularity (currently limited to Hispanic / non-Hispanic) that prohibit investigation of the social determinants of health when data holders have better more granular data… Those are the dimensions of the problem space that the current standard is insufficient for…

Investigation of the health impacts of self identified ethnicity requires a more granular representation. @AsiyahFDA thanks for looking at the EO and assessing it’s relevance and readiness for use. Do you think it is worth trying to build from it?

@Andrew Regarding to extend EO, please see my assessment above: the EO is aligned to UK use. For an ontology, without providing textual definition, people may use in different ways, thus introduce heterogenous.
If OHDSI wants to extend ethnic groups, given the international participant of OHDSI, it maybe worthy to look at SNOMED to see if anything is available for international use to start with.


Looks like @AsiyahFDA’s link is broken, but in Athena you can find them here. Not sure there is a useful hierarchy, but happy to switch over from OMB if folks find it more appropriate to what they are doing.


There are actually a whole ton of ethnicity-related ontologies listed in the bioportal: Search | NCBO BioPortal. Why don’t you check them out and tell us which one you like, and we have a path forward. @esholle, @Andrew, @andrea, @AsiyahFDA, @linikujp, @y7g2p, @roger.carlson, @SCYou, @Doc_Ed, @Vimala_Jacob? Anybody up for doing the homework so we can rest this subject (which will otherwise keep playing wackamole every 6 months):


how about this one?


like @linikujp said, It maybe possible to just add my subgroups here

@Andrea, @Christian_Reich,
I didn’t look through all the terms, but a lot of them do not have children terms for “ethnic group”. However, I found a big list of Chinese ethic terms under NCIT’s “ethnic group”. Please check all the children terms under Ethnic Group - National Cancer Institute Thesaurus (NCIT)

The best is to bring this to the vocabulary.

And I think the race and ethnic group terms arranged in NCTI needs us to take a good look. Maybe reusable.

I have checked all the items.it has 56 ethnic groups and each of them are my required.thanks @linikujp


Thank you for the discussion on this topic. We in CDM working group would like to use this forum post to come to a decision, if possible, as to the best way to represent this data in the CDM. Some potential solutions that have been proposed:

  • Keep race in PERSON, move ethnicity to OBSERVATION
    • observation_concept_id is “ethnicity of person”, value_as_concept_id is actual ethnicity
  • Keep both in PERSON but update the ontology in the vocabulary - I believe the link sent out by @AsiyahFDA showing the SNOMED ethnicity is already in the vocabulary but the NCI ethnicities need to be added
  • Keep either race or ethnicity and remove the other to reduce confusion

Anything I missed?