OHDSI Home | Forums | Wiki | Github

Race and Ethnicity in the OMOP CDM


I initially wrote another long exposé about everything that’s wrong with that research on race and health outcomes, particularly if you want to do it internationally, but then deleted it. Reason:

Let’s focus on that. I.e. before we go into solution space let’s define the problem space. Here is what I can glean from your comments:

  • Is “ethnicity … a proxy for wealth, education, or other factors”?
  • Are “descendants of Francophone population… not nearly as healthy as communities with non-Francophone descendants”?
  • Do “patients from underserved populations … have no structured documentation of race/ethnicity” and do “patients from these populations who lacked structured data differ significantly from those who had it”?
  • Is the “impact of being Black in America on your health … very different than the impact of being Black in Egypt or Brazil”?

Is that it? And from all I can see you are saying the current OMB thing works, right? Do we not have a problem? Because again, without a problem there is no need to work on a solution.

There probably is, but how many of our data sources have information that detailed about people’s ancestry? I don’t know much about claims data, but I can’t imagine there’s much there. And in US EHR data, I’m pretty sure no one’s distinguishing between where someone’s grandparents came from. Plus all of the health-relevant concepts, e.g. potential Tay-Sachs carrier, Hb/ss status, are already mapped in the vocab, no?

Some thoughts:

  • Is “ethnicity … a proxy for wealth, education, or other factors”?

Maybe, but probably not in a very interesting way, and probably not as good of a proxy as insurance status or neighborhood-level variables, which, for US pts anyway, we can get by geocoding addresses. Plus, we’d probably be more interested in looking at the role it plays after you control for wealth, education, etc. anyway.

Do “patients from underserved populations … have no structured documentation of race/ethnicity”

Obviously they don’t have no documentation, but I suspect they’re less likely to, (and, more broadly, that data quality issues disproportionately impact these communities). That said, we’d need a gold standard to be sure. Probably the next line of research after that JAMIA paper. I can also check to see if our NLP pipeline “filled in the blanks” as disproportionately Black/Hispanic for the NULLs/DECLINEDs/OTHERs. That said, this is another line of research, and one that seems to me to be only tangentially relevant to the question at hand: how do we handle the mapping of race/ethnicity for the OHDSI community.

  • do “patients from these populations who lacked structured data differ significantly from those who had it”?

At least at our institution, they definitely do - see above ref.

Is the “impact of being Black in America on your health … very different than the impact of being Black in Egypt or Brazil”?

This is the kind of question that militates for inclusion and good mapping of race/ethnicity in OHDSI! If we’re using the same terminology for these concepts across all of our databases, we can ask and answer questions like this.

And from all I can see you are saying the current OMB thing works, right? Do we not have a problem? Because again, without a problem there is no need to work on a solution.

I think I was saying that it works for US data sets, but that ultimately the way we do it is going to depend on the specific question we’re asking and the peculiarities of the source data. I know that the whole point of OHDSI is that we all pick one way to do it (the way that lets us ask the maximum amount of questions) but I think this is an area where we’re never going to get it 100% right. I think the best we can do is use OMB mapping for US data sets, try to coerce non-US data sets to the OMB mapping if we want to ask questions about race across internat’l borders (like the Brazil thread from a while ago), but in general, adhere to the fit-for-purpose standard.

So basically, I’m saying that I don’t think we have a problem. Patrick and Fernanda were able to publish their paper, we have internal use cases for using the data in its current format to assess disparities in quality of care, etc…I can see how it wouldn’t work for source data from other countries, but until/unless we have compelling use cases from those countries, why do anything? And if/when we do, I think you hit the nail on the head in your initial post:

keep both ETHNICITY_CONCEPT_ID and RACE_CONCEPT_ID, but we split off the lower half of the Race/Ethnicity Concept hierarchy to and form a new Ethnicity Domain (including the hispanics). It that world, any ethnicity can be combined with any race.

This way people with Swedish data sets can map Sami as an ethnicity, and people with US data sets can map HIspanic/non-Hispanic.

You illustrated out a use case of combining US and OUS data on race and ethnicities.
+1 from me.

@Christian_Reich I think the current standard had problems with the domain (Person vs Observation) definition (relationship to race) and granularity (currently limited to Hispanic / non-Hispanic) that prohibit investigation of the social determinants of health when data holders have better more granular data… Those are the dimensions of the problem space that the current standard is insufficient for…

Investigation of the health impacts of self identified ethnicity requires a more granular representation. @AsiyahFDA thanks for looking at the EO and assessing it’s relevance and readiness for use. Do you think it is worth trying to build from it?

@Andrew Regarding to extend EO, please see my assessment above: the EO is aligned to UK use. For an ontology, without providing textual definition, people may use in different ways, thus introduce heterogenous.
If OHDSI wants to extend ethnic groups, given the international participant of OHDSI, it maybe worthy to look at SNOMED to see if anything is available for international use to start with.


Looks like @AsiyahFDA’s link is broken, but in Athena you can find them here. Not sure there is a useful hierarchy, but happy to switch over from OMB if folks find it more appropriate to what they are doing.


There are actually a whole ton of ethnicity-related ontologies listed in the bioportal: Search | NCBO BioPortal. Why don’t you check them out and tell us which one you like, and we have a path forward. @esholle, @Andrew, @andrea, @AsiyahFDA, @linikujp, @y7g2p, @roger.carlson, @SCYou, @Doc_Ed, @Vimala_Jacob? Anybody up for doing the homework so we can rest this subject (which will otherwise keep playing wackamole every 6 months):


how about this one?


like @linikujp said, It maybe possible to just add my subgroups here

@Andrea, @Christian_Reich,
I didn’t look through all the terms, but a lot of them do not have children terms for “ethnic group”. However, I found a big list of Chinese ethic terms under NCIT’s “ethnic group”. Please check all the children terms under Ethnic Group - National Cancer Institute Thesaurus (NCIT)

The best is to bring this to the vocabulary.

And I think the race and ethnic group terms arranged in NCTI needs us to take a good look. Maybe reusable.

I have checked all the items.it has 56 ethnic groups and each of them are my required.thanks @linikujp


Thank you for the discussion on this topic. We in CDM working group would like to use this forum post to come to a decision, if possible, as to the best way to represent this data in the CDM. Some potential solutions that have been proposed:

  • Keep race in PERSON, move ethnicity to OBSERVATION
    • observation_concept_id is “ethnicity of person”, value_as_concept_id is actual ethnicity
  • Keep both in PERSON but update the ontology in the vocabulary - I believe the link sent out by @AsiyahFDA showing the SNOMED ethnicity is already in the vocabulary but the NCI ethnicities need to be added
  • Keep either race or ethnicity and remove the other to reduce confusion

Anything I missed?

@clairblacketer you may consider to add the NCIT resource in the conversation above.

Hi @linikujp you’re right, we would need to add these. I’ll add that to my list.

Hi, has there been any solution to this issue? In addition to the listed ontologies shared by @Christian_Reich, there are other consortiums working on this very same issue. For example the ClinGen consortium has a specific working group Ancestry and Diversity - ClinGen | Clinical Genome Resource I am wondering if OHDSI should start one group devoted only to this issue. I think we should consider not only race and ethnicity, but also ancestry, religion, nationality, etc.

Is there a reason that race and ethnicity are in the person table and not in the observation table? I ask because: 1/ these are usually patient reported and patient may be asked on admission even if they have a record already. 2/ increasingly, people are more and more multi-racial and multi-ethnic which is obvious to anyone who has ever seen a 23 and me result. 3/ we see that a number of patients will change their ethnicity at different visits. Still exploring why that is but the going hypothesis is that multi-ethnic patients will use whatever they see as the most beneficial for them for that visit. If they get admitted through emergency, minorities might not want to be slowed by biased triage and decalre themselves as white, but when admitted to a ward directly they may choose to decalre their minority group to get more financial aid. 4/ similar but insurance fraud when the patient has no legitimate claim to the ethnicity they are stating.

If these were observations I would feel more cofortable using non-standard concepts.

Would it violate any CDM rules to leave the ethnicity and race as 0 and create observations? Instead of 0 we could also put the first observed ethnicity and race in the person table and still add the observations.

We have a client data where there are more than 1 race for patients. What we did is to put one of values into Person table and the rest race values into Observation table. This does not violate OMOP rules as there are standard observation concepts for race and ethnicity. I am listing them below:

  • 4013886 Race
  • 44803968 Ethnicity

These concepts are loaded into observation_concept_id column and the actual race / ethnicity values are put into value_as_concept_id fields in Observation table.

1 Like

@QI_omop: We need to ratify this, by the way. We need to tell @clairblacketer.

The whole thing violates the OMOP idea: Creating a standard that everybody adheres to, so that data no longer need the context in which they were generated to be correctly analyzed. This standard should be objectively defined. Race and ethnicity however are not objectively definable. Worse, they are self-assigned, which means they are not even defined within a data asset.

So, I think the solution is what you guys laid out: Standard simple self-assignment in PERSON, and details (3/7th of an Inuit) into the OBSERVATION table. We do a similar distinction between crude and detailed with Location.

Has this been established? Where can we go to learn current standard?

We have just begun to OMOP our data at my institution and the race ethnicity problem immediately confounded us. We have mapped our data (we are in the US) to the CDC standard codes. Are the CDC codes represented in the OMOP standard vocabulary? If this is not the forum for these questions please direct me to the appropriate place.

Thank you!

There isn’t an established convention for adding more than one race or ethnicity to the CDM.

I will mark this thread as a Themis issue and create an issue in the Themis GitHub. Stay tuned!

Edit to add link to GitHub issue.