OHDSI Home | Forums | Wiki | Github

Multiple race solution?

Our EHR data has more than one column for race per person. We have requests for data on all races (the U.S. gov’t defined 5 races) a person claims. Some of these requests are associated with conditions, some are associated with procedures, and at least one also includes social variable. How are others handling this issue?


What’s the use case? If somebody has several races, I assume it’s a mixed race, correct? Can you do outcomes research on mixed race?

This was addressed recently with THEMIS.

When a person has multiple races it is recommended to choose the latest race found for the patient record, unless a more appropriate approach exists for your data. If you would like to keep other designations of race, do so in the OBSERVATION table

Add under conventions on the PERSON wiki page.

@ericaVoss I like the idea of putting the other races in the OBSERVATION table, but what do we do with the observation_date field? Do we put the DOB or the date we run the ETL…?

for facts “decided at inception” (race) - using DOB is what we have done with other data.

It is a fact that typically does not change over time (could be “clarified” later) but biology should not change.

Think of what the code grabbing this info would do - it would grab the latest instance of that concept_id. (or reason a bit if there are multiple rows).

Would you have a date stamp associated to when the race was listed? I would think you would just use that. Otherwise I would probably just say MIN(OBSERVATION_PERIOD_START_DATE).

If you use YEAR_OF_BIRTH you’d end up outside of the OBSERVATION_PERIODS, which is fine, but is that what is wanted?

It sounds like there are 2 different scenarios mixed together here:

  1. An individual has multiple races (as described in the original question). The solution described here does “solve the problem” in that one assigns a “mixed race” concept that means the CDM can be used as is. Having said that, if I were to ask “give me all persons of African American race”, persons who were of “mixed race” (say African American and Asian) would not be selected from the person table. This is a very challenging problem to address in a robust manner (eg, in graphs), so it may just be a “punt” to the user to have to handle it.

  2. There are multiple events where a person’s race is recorded. In this situation, Erica’s suggestion about the date stamp of the recording of the race makes sense in that setting. This is opposed to the view where we are making assertions about the inferred state at conception also described here. It depends on the type of data sources it seems. In the AoU setting, we have different “tiers” of data sources, so I doubt we’ll use the “most recent” approach to resolve this. We will take the most recent of X>Y>Z, where X, Y, and Z are different data sources depending on what is available for that individual (or something along those lines).

From my perspective, it would be nice if the Person table could support something more than just “Mixed Race” for these situations, but building out such a solution is complicated (does one create as concepts as there are combinations?).


What is the use case? What are the questions you want to answer with these mixed races? Do you have a list?

It gets complicated very quickly as one starts thinking about genetics and other indicators. One might wish to study the prevalence of disease X in population Y. One could study social determinants of health in the context of racial differences. One could compare self-reported ancestry with genomics. It’s a very important feature in epidemiological research in the USA in general.

The answer may be “Sorry, this field won’t represent that information”. I wonder what use cases are met well by the current summary field in the person table; I suspect it’s simply an input data constraint that has caused it to be not necessary. It’s my understanding that a breadth of groups, eg the US census for 2020 and our local medical record, are changing the way they allow data collection to support the growing admixed populations. AoU doesn’t capture “ethnicity” separately from “race” now, which reflects the changing US cultural perspectives on these facets. In the same vein, one could imagine a single race/eth field that includes concepts for the combinations that leverage the hierarchies to be able to ask the same questions one could of the split fields.


If I understand you correctly the question is “are there genomic markers correlating with self-reported race?” Sounds good. Race doesn’t change during the course of the life (unless you run a 23andMe and find out what you really are, like my wife did), so you can collect all races mentioned:

select person_id, race_concept_id from person
select person_id, observation_concept_id from observation join concept on observation_concept_id=concept_id and domain_id='Race'

Solves the problem?

The data is certainly available as is. I’m suggesting more that the current Race (and Ethnicity) field(s) have a format designed with certain existing data sets/formats in mind and that newer data collection may be more robust. It’s an opportunity to rethink how the data is summarized to make the field more powerful for analysis in the context of this new data.

Urgh. I thought I cleared it, but now I am confused again.

What are “newer data collections”? How are they more robust? What do they look like?

Usually, if we want to jam more than one thing into a single field, we use pre-coordinated concepts. Like “Tuberculosis of the lung” is a “tuberculosis” and a “lung disease” combined. Are you suggesting to create some hierarchiy with “Black”, “White”, “Mixed black and white”?