Race and Ethnicity in the OMOP CDM

I’d second @Patrick_Ryan’s assertion - we have some internal use cases as well.

As I remember arguing in an earlier thread, and as you state here, @Christian_Reich, just because race/ethnicity don’t have a solid, immutable biophysical grounding doesn’t mean that they don’t have a concrete impact on health outcomes, whether this is driven by underlying molecular variation (e.g. well-documented differences in diabetes phenotypic expression between people with East Asian vs. European ancestry), sociocultural factors (e.g. higher incidence of hypertension in Black Americans), or some combination of the two.

That said, I share your frustration with the lack of standardization in how this is documented. In a recent JAMIA paper, our group described the results of a natural language processing pipeline designed to extract race and ethnicity from free text notes. In short, we not only found that many patients from underserved populations didn’t have structured documentation of race/ethnicity, but also that patients from these populations who lacked structured data differed significantly from those who had it! In putting together this work, we relied heavily on the OMB definitions you mention above, but a lot of the free text documentation was very different - as you mention, people tend to conflate race and ethnicity, and a lot of Hispanic people don’t seem to agree with the gov’t idea that you can be Black Hispanic, White Hispanic, etc.

I continue to go back to the idea of fit-for-purpose. Just as there might be some different vocabularies in use in different countries, the distribution of race/ethnicity codes is going to be different between countries. For the US, OMB definitions are not perfect, but I think it’s best to just keep using them for US data sets and let other countries do whatever’s most appropriate for them. Even if we end up with 0 for race_concept_id and Hispanic for ethnicity_concept_id, that still lets us ask questions like “are Hispanic patients at higher risk of X?” Folks in other countries where, say, Sami is a more pertinent ethnicity than Hispanic can map how they see fit, and any analyses that depend on international data will have to take the differing mappings into account. This also, to me, ties in with the idea that the impact of race/ethnicity on health outcomes is going to be specific to the sociocultural context of race/ethnicity - impact of being Black in America on your health is going to be very different than the impact of being Black in Egypt or Brazil. I don’t think there’s any trickery we can do with vocab or mapping that will get around the local peculiarities of the relationship between race/ethnicity and health.