OHDSI Home | Forums | Wiki | Github

Dealing with multiple races and other exceptions

We are trying to adhere closely to OHDSI standards for the REDS study.

Coding a single race in person.race_concept_id is straight forward.

id code name class standard valid domain vocab
8657 1 American Indian or Alaska Native Race Standard Valid Race Race
8515 2 Asian Race Standard Valid Race Race
8516 3 Black or African American Race Standard Valid Race Race
8557 4 Native Hawaiian or Other Pacific Islander Race Standard Valid Race Race
8527 5 White Race Standard Valid Race Race

Where a person reports more than one race, we put the standard values into observation_concept_id, using multiple records and populate person with “multiple race”.

However, to do so, it appears that we need to stray from the standard and use a PCORNet concept.

id code name class standard valid domain vocab
44814659 Race-06 Multiple race Race Non-standard Valid Observation PCORNet

Likewise, for a person who declined to answer we could populate the person.race_concept_id with another non-standard PCORNET concept from the Observation domain, or a standard LOINC concept from the Meas Value domain.

id code name class standard valid domain vocab
44814660 Race-07 Refuse to answer Race Non-standard Valid Observation PCORNet
36210418 LA27922-6 Decline to answer Answer Standard Valid Meas Value LOINC

Finally, when a subject responds with “other” or “unknown”, it appears that there is no valid code that we could use.

id code name class standard valid domain vocab
8552 UNK Unknown Race Non-standard Invalid Race Race
8522 9 Other Race Race Non-standard Invalid Race Race

Q1) Is there a consensus on how RACE should be dealt with, to allow for such data?

Q2) Would it be reasonable and practical for OMOP’s RACE vocabulary to include additional options?


Thank you for your question @Pulver . The subject of Race and Ethnicity has been a topic of conversation for a while, and can be seen in the forum thread here. There is a proposal to update OMOP Race and Ethnicity here where it states that non-standard concepts such as the one you’ve mentioned ‘other race’ are mapped to 0, as well as the proposal to add additional concepts such as ‘multiple’. Though as of now, this proposal for updating has not yet been implemented. I hope these links will help you further with your question.

1 Like


Should we get this nailed? As @janice said we have been noodling this for a long time.

Summary of the solution:

  1. We leave ethnicity_concept_id and race_concept_id intact.
  2. We create one combined ethnicity/race vocabulary that is a union of all existing lists we have right now.
  3. Anybody can contribute additional concepts and map them to existing ones as they see fit.
  4. The Vocab team attempts no deduping, as it is not feasible for the lack of objective definitions.

For the ETLers: Put the concept where you think it is more appropriate.
For the analysts: Create long conceptsets of quasi-equivalent concepts for cohort definitions, and find them in either ethnicity or race_concept_id.


1 Like

From a standpoint of pure logic, Unknown:8552 is the same as putting in 0; there is no need for it.

On the other hand, Refused to answer:44814660, does tell us something positive about the patient, this might be a confounding data point in some very specific studies. My only problem with this is, will ETL’ers start using it just to get their numbers up. This is more an AOU concern where there is a push to put in data that is questionable at best.

1 Like

This doesn’t make any sense. If we leave the two fields intact as per #1, why are we combing the domains in the Concept table as seen in point #2?

Bad idea. We strive to have clear, concise, unambiguous conventions for ETLers. No guessing, the domain_id of the standard concept tells the data where to go in the CDM.

And what’s the end user to do? Query both fields to retrieve ethnicity-race? I also, don’t think this is a good idea. As you have said many times, the CDM is analyst ready.

There is an open Themis issue here, but it lacks a sponsor. As an open source community, we rely on community members to contribute and drive the evolvement of the standards, methods and research. The Themis process is found on the Themis GitHub home page here. Who would like to sponsor this topic?

Thank you all for your responses.

The only use case that I imagine to justify having “Unknown:8552” as a distinct value for person.race is the extremely rare situation of a person who asserts that they do not know their race. This is not quite the same as “refuse/decline” to answer.

I agree.

Regarding “Other”, I am at a loss as to why someone would select this other than if their concept of “race” differs from the “standard” that we use. They might conflate race with ethnicity, identifying as “Ashkenazi Jew” or “Japanese” rather than “White” or “Asian” respectively. Perhaps this too should be differentiated from not answering the question. It might suggest to an analyst that a look at person.person_source_value is in order. Personally, as in these cases, I would just map to “White” and “Asian”.

As to person.ethnicity: having ethnicity be a binary choice between Hispanic and non-Hispanic seems extremely USA-centric. While some vocabularies are specific to a country, shouldn’t Race and Ethnicity be geographically-agnostic?

My RADICAL (perhaps heretical) PROPOSAL

My suggestion is to address the issue of multiple races and other problems by removing race_concept_id and ethnicity_concept_id entirely, treating race and ethnicity as observations. This would allow a person to be coded with any number of races and ethnicities simultaneously, and eliminate a structural need to distinguish between those two concepts.

If in the US there’s a need to distinguish between a White Hispanic person and a Black Hispanic person, no problem. Likewise biracial or triracial. If a Belgian study depends on distinguishing Flemish from Walloon, it’s covered. For research in South Africa, should differentiating cohorts by tribe be important: Nguni, Sotho, Shangaan-Tsonga and Vend are easily accommodated. Need greater granularity in RSA? Include Zulu, Xhosa, Ndebele and Swazi.

Likewise a study comparing Anglophones against Francophones, and Hispanophones.

The idea is to replace Race and Ethnicity languages/domains with a new domain: Social Characterization, which could similarly address languages, gender identity, etc.

Complete ability to customize. For a study combining USA, Belgian and South African populations which needed to differentiate by “traditional” race values, it would be a simple matter to map Flemish to White and Swazi to Black.

Yes, analysis would require more complicated queries than the current approach. The question is: Would the greater flexibility be worth it?

[Note: I have no idea whether health studies differentiate by the categories that I mention or if doing so would be of value.]

What does it tell us?

@Mark is right. “Unknown”, “Other”, "Don’t want to tell’, “Bugger off” and all those other ways telling us we won’t have the fact are all flavors of null. And that is where they should be mapped to. I know they are very popular. I have never seen a use case for them.

I know. I proposed to combine the fields into race_ethnicity_concept_id, since we cannot separate them. But @Jake had the strong opinion that folks may have a preference whether something is more of an ethnicity or a race, even though there are overlaps. But feel free to bring it on again.

You saying that is music to my ears. :slight_smile: So, bring it on.

I could be the sponsor, since I have been pushing it more than once. It’s time to bring it home.

Yes, this is the idea, @Pulver. Except not to make it an observation, but a demographic field. Why? Two reasons:

  1. Everybody expects it there. Not a strong reason, though.
  2. We want the ETL to figure out contradictions and multiple records and deliver one fact. In contrast to gender, the race/ethnicity seems a lifelong stable attribute.

Are you suggesting multiple demographic records or several social_characteristic fields? If not, then how would multiple characteristics be such as a biracial Hispanic person be recorded?

By using Observation records there is no limit to the number of attributes that could be assigned to a person. It also accommodates characteristics of local significance in places other than the USA. Isn’t OMOP intended to be usable everywhere?

So, your point is we need more than one per person? Why can’t we have pre-coordinated concepts like in the other domains?

An observation is something non-static. Something that is observed at a point in time. Ethnicity, race, sex, birthdate, death date, all that are static demographic facts.


If I am valid representation of that group of people who refuse to answer personal questions, said responder is going to do their own research and should it disagree with what the provider told the person, they will ignore it. :wink:

In other words, if it is a study that requires that the patient has followed the instructions to the tee, that perhaps said responder should be removed from study. Of course, this would not be true for all studies.

I see. But observational studies usually are retrospective. So, them not wanting to say what they are means nothing with respect to the study, because it wasn’t even undertaken then. Also, the patients don’t get contacted.

As far as observational data are concerned, all we know is that we don’t know. Why we don’t know is irrelevant. And a fact that is not known will not be recorded, or, in demographics or other mandatory fields, will be concept_id=0.


I think this is where my past experience in writing engineering software causes me to see the world from a different viewpoint. In an engineering module, why one does not know something may cause a different heuristic to be used to interpolate/extrapolate the missing data.



I agree with you if you need to impute missing data, anything you can hang on a hat would help. For the ETLer that may be something to think about. But for the analyst? That poor wretch has no chance to tweak individual patients. Phenotyping is hard enough the way it is using the data at face value.

If you have a use case of using this type of information for an analytic those flavors of null could be back in business.

We deal with longitudinal observational studies, utilizing both historic and contemporaneously obtained data spanning multiple years.

People occasionally change how they self-identify their race/ethnicity.
Potential cases a person:

  • of African ancestry may decide that they prefer to be seen as “bi-racial” or “black”.
  • who, owing to family lore, claimed to be part Cherokee, finds out that it was not true.
  • adopted by a “white” family as an infant, might learn as an adult that one of their biological parents was half Navaho

It could be of significance to a socio-medical researcher that for some part of the study period, the person believed themselves to have American Indian blood.

As race and ethnicity are self-reported social constructs, not “facts”, that vary widely about the world, I believe that we need to maximize our flexibility. How else would we be able to run a study with sites scattered between Australia, New Zealand, and the US?

While our theoretical partners in New Zealand would seem unlikely to have an interest in whether a subject is Hispanic, being able to tease out the Maori population could be critical. Likewise Aboriginal citizens in Australia.

By the way, what standard “Race” would you expect Australian Aborigines to check off on a survey instrument?

Though there are SNOMED codes for Australian Aborigines and New Zealand Māori, they are neither in the RACE vocabulary nor Standard concepts. How would you code an Australian asserting mixed heritage, one parent being Aboriginal, the other being of British (white) descent?

I disagree! In conducting research into quality of care, I would want to differentiate between subjects who refused to answer a question, didn’t know the answer, and selected an answer for which there is no standard concept. Similarly, it can be very important to distinguish a patient to whom a question was not posed or for whom a metric was not assessed.

I have been following this as Gerry (@Pulver) put it for a project I am the site PI for. First I want to be sure we are aware that a large percentage of people in the USA are born as one race ethnicity and die as another. These are dynamic constructs, like gender anymore, and people change their reporting of them over time. Not sure tracking this is critical but the idea that they are static is not correct. So the argument about Observation v demographic isn’t quite right. Second, diversity, equity and inclusivity are huge issues these days to both our patients and the institutions I work with. Many institutions are very careful about how they capture these data. When you curate data for these institutions they do not like to lose the granularity in their data. Just did a stratified sample for a research project where it was critical to stratify by the types of dual and tri racial people in the population. Finally, remember these are to be self-reported data - we as providers or data managers are not supposed o be changing what is self-reported. That is the US Census model. While we are familiar with the 5 race and 2 ethnicity roll up even the US Census does sampling that includes detailed family information that is part of the ethnicity area if you happen to get the long sample. Finally, as @Pulver mentioned, as soon as you move outside the US these categories blow up. Seems like an internationally acceptable approach should be sought. As noted above, if you are from Australia and have been told all your life the your race is aborigine which is a race in Australia and come to the US it would make perfect sense for that person to answer “other”. Analytically you may roll the all up but making data decisions to decrease granularity instead of analytical decisions of when to roll things together doesn’t seem to me to be the best overall approach.

1 Like


All good.

But we have a system today. If you want to change it, we need:

  1. The analytic use case. Having the data is not a use case. Surely there must be one, if the institutions go through the trouble and collect the information.
  2. A proposal for that use case that delineates how the use case is fulfilled (and the current one does not do the job). And yes, it needs to work globally.
  3. A debate to decide amongst the existing solution, your proposal and the other proposals.

Makes sense?

“Other” as opposed to which “one”? This does not work. We run a Closed World system. We know what we know (what’s in the record), and what is not in the record is not there. If you have an Australian Aborigine you declare it. There is no such research question asking “tell me the outcome XYZ on all patients with the race ‘other’!”. The answer, if there were one, would have little meaning since it would not be interpretable.

Of course, since it is a social more than a biological entity, in the US “Australian Aborigine” does not have the meaning and implication it has in Australia, and any outcomes research will not jive between these two countries. One solution could be that we define ethnicity and racial value sets specific for a geography. But again, you propose, I am just facilitating.


Perhaps I would better understand your perspective if you explained, using the current schema, how you would you populate person.race_concept_id and ethnicity_concept_id in a longitudinal database, using domain-consistent standard concepts, keeping in mind that as you can’t foresee questions which will be addressed by future users, you wish to minimize loss of intelligence.

Situations we may face on large-scale USA-based studies:

  1. Bi-racial person identifies as Asian and African American?

  2. Person identifies as Spanish?

  3. Person identifies as Chamorra?

  4. Person identifies as Chicano?

  5. Person says “I don’t know”?

  6. Person explicitly refuse/declines to answer?

  7. Person writes in “Australian Aborigine”?

  8. Person identifies as African American early in the study, but years later identifies as bi-racial?

Hypothetical cases that i expect our colleagues overseas may be likely to encounter:

  1. Bi-racial person identifies as Australian Aborigine and English?

  2. Person identifies as Spanish?

  3. Person identifies as New Zealand Maori?

  4. Person identifies as Zulu?

  5. Person identifies as Han?

How or would you code the five “overseas” scenarios differently if you were in Europe, Australia, Asia, or Africa?

[It could be helpful if folks from outside the US joined this discussion!]


1 Like