Friends:
Recently, the debate flared up in the CDM Working Group about Ethnicities and Races, and whether the current model is adequate or needs changing. And it keeps coming back and back and back:
I am happy to admit that personally I have a strong aversion against all the tribalism and identity assignments of people, and the fact that we are spending so much time on something that has not yielded a single use case so far for OHDSI. Yes, it is easy, everybody believes he or she “understands” races, and therefore we can endlessly debate it. Gender/Sex debates are equally popular and fruitless from the perspective of us generating evidence that promotes better health decisions and better care, as our mission requires.
But let me set my negative affections aside. Let’s discuss what we really might need, and how it should be represented.
The current ETHNICITY_CONCEPT_ID/RACE_CONCEPT_ID clearly do not work well: They are following some US Office of Management and Budget guidelines for how the US Census should deal with it. The whole Ethnicity “Hispanic” or “Not Hispanic” we have now is because in one of the censuses the third largest race was “Other”, and it was picked mostly by Hispanics. So, a separate entity was created to deal with that. For us I would argue it does not make sense, and more importantly, cannot work anywhere else in the world. In Korea, the split of ethnicities into “Hispanic” and “Not Hispanic” is ridiculous.
The question is could there be a clean solution? Currently, race and ethnicity definitions are based on the assumption that people can be meaningfully clustered based on similarities on the genetic, phenotypic, geographical co-location, social and/or religious attributes. Which one? Whatever works, it seems:
- Genetic: Obviously, our genes get inherited according to clear rules, and we have haplotypes (genes on the same chromosomes that cannot be easily inherited independently from each other). But as we know from the Genome project our alleles have a cosmopolitan distribution, like that of marine fish and birds. The distribution is not equal, but you have all alleles in all currently defined races. And the difference between any two people within a race on average is higher than between the average of the races. The best illustration for that is that the White Americans Craig Venter and James Watson are more different from each other than each of them to the Korean geneticist Kim Seong-Jin, all of which are amongst the first people having their genome fully sequenced.
- Phenotypic: I am not even starting. None of these attributes we think we so easily recognize have a common foundation allow unambiguous race distinction: Just one example: There are dark-skinned people in South Asia and various native races. And the thing breaks down when people mix. Obama is black? Why? He is just as white.
- Geographical co-location: yes in the pre-industrial world people couldn’t easily move around, and current ethnicities are defined through ancestry from a geographical region. Many people have no clue about the geographic location “they are from”, as the ancestry service commercials nicely show. We discussed the concept of the “French Canadian”, which is a geographical location that arguable was established rather recently based on mass-migration. From where? All over the place.
- Cultural and religious: don’t get me started. That stuff is completely elusive and only works locally. Apart from their language, the “Hispanics” in the Americas have a very different cultural distinction from the others compared to those in Europe. And obviously, these attributes can be easily changed. I can change my religion. Hey, I actually did a couple times (on paper, happy to tell what happened over beer). I am sure my outcomes are still the same.
People argue that we might not be able to properly define it, but the self-assigned race and ethnicity have strong correlations to disease progression and treatment outcomes, as studies show all the time. So, something must be going on. I agree. But it is a mixture of all the above, which drive social and economical disparities, and often these features not being a function of the individual but that of the society around them. Except we have no way of capturing this globally.
Another argument I heard was that maybe we should accept the sub-optimal definition of the thing, since other Domains also have inaccurate concepts (e.g. syndromes in Condition). Far from it. In Condition, we may have some blurring on the edges, but most of the Conditions are pretty clean, and they work all over the world. While race and ethnicity are never clean. Not one of them.
Bottom line: There are personal or societal features of individuals that allow for some clustering. But these clusters have no definable cut-off lines, they are only reproducible locally (if that) and over historically short periods of time. In other words: We are all just Kenyans, who did some hanky panky with the neanderthals.
Here are pragmatic proposals of how to deal with this, removing the obvious flaws of the current system, but allowing the arsenal of OHDSI methods (including preference scoring) to continue working.
- We get rid of ETHNICITY_CONCEPT_ID, and continue only with the RACE_CONCEPT_ID, using the mixed Race-Ethnicity Concept hierarchy we currently have (but include the hispanics).
- We keep both ETHNICITY_CONCEPT_ID and RACE_CONCEPT_ID, but we split off the lower half of the Race/Ethnicity Concept hierarchy to and form a new Ethnicity Domain (including the hispanics). It that world, any ethnicity can be combined with any race.
- We drop the Race and Ethnicity Domain Concepts until such time that we can properly model biological and socio-economical attributes of people on a global level. Till then, folks can use their own >2B local concepts.
Sorry it got so long. Any thoughts?