Race and Ethnicity in the OMOP CDM

Some thoughts:

  • Is “ethnicity … a proxy for wealth, education, or other factors”?

Maybe, but probably not in a very interesting way, and probably not as good of a proxy as insurance status or neighborhood-level variables, which, for US pts anyway, we can get by geocoding addresses. Plus, we’d probably be more interested in looking at the role it plays after you control for wealth, education, etc. anyway.

Do “patients from underserved populations … have no structured documentation of race/ethnicity”

Obviously they don’t have no documentation, but I suspect they’re less likely to, (and, more broadly, that data quality issues disproportionately impact these communities). That said, we’d need a gold standard to be sure. Probably the next line of research after that JAMIA paper. I can also check to see if our NLP pipeline “filled in the blanks” as disproportionately Black/Hispanic for the NULLs/DECLINEDs/OTHERs. That said, this is another line of research, and one that seems to me to be only tangentially relevant to the question at hand: how do we handle the mapping of race/ethnicity for the OHDSI community.

  • do “patients from these populations who lacked structured data differ significantly from those who had it”?

At least at our institution, they definitely do - see above ref.

Is the “impact of being Black in America on your health … very different than the impact of being Black in Egypt or Brazil”?

This is the kind of question that militates for inclusion and good mapping of race/ethnicity in OHDSI! If we’re using the same terminology for these concepts across all of our databases, we can ask and answer questions like this.

And from all I can see you are saying the current OMB thing works, right? Do we not have a problem? Because again, without a problem there is no need to work on a solution.

I think I was saying that it works for US data sets, but that ultimately the way we do it is going to depend on the specific question we’re asking and the peculiarities of the source data. I know that the whole point of OHDSI is that we all pick one way to do it (the way that lets us ask the maximum amount of questions) but I think this is an area where we’re never going to get it 100% right. I think the best we can do is use OMB mapping for US data sets, try to coerce non-US data sets to the OMB mapping if we want to ask questions about race across internat’l borders (like the Brazil thread from a while ago), but in general, adhere to the fit-for-purpose standard.

So basically, I’m saying that I don’t think we have a problem. Patrick and Fernanda were able to publish their paper, we have internal use cases for using the data in its current format to assess disparities in quality of care, etc…I can see how it wouldn’t work for source data from other countries, but until/unless we have compelling use cases from those countries, why do anything? And if/when we do, I think you hit the nail on the head in your initial post:

keep both ETHNICITY_CONCEPT_ID and RACE_CONCEPT_ID, but we split off the lower half of the Race/Ethnicity Concept hierarchy to and form a new Ethnicity Domain (including the hispanics). It that world, any ethnicity can be combined with any race.

This way people with Swedish data sets can map Sami as an ethnicity, and people with US data sets can map HIspanic/non-Hispanic.