So, your point is we need more than one per person? Why can’t we have pre-coordinated concepts like in the other domains?
An observation is something non-static. Something that is observed at a point in time. Ethnicity, race, sex, birthdate, death date, all that are static demographic facts.
If I am valid representation of that group of people who refuse to answer personal questions, said responder is going to do their own research and should it disagree with what the provider told the person, they will ignore it.
In other words, if it is a study that requires that the patient has followed the instructions to the tee, that perhaps said responder should be removed from study. Of course, this would not be true for all studies.
I see. But observational studies usually are retrospective. So, them not wanting to say what they are means nothing with respect to the study, because it wasn’t even undertaken then. Also, the patients don’t get contacted.
As far as observational data are concerned, all we know is that we don’t know. Why we don’t know is irrelevant. And a fact that is not known will not be recorded, or, in demographics or other mandatory fields, will be concept_id=0.
I think this is where my past experience in writing engineering software causes me to see the world from a different viewpoint. In an engineering module, why one does not know something may cause a different heuristic to be used to interpolate/extrapolate the missing data.
I agree with you if you need to impute missing data, anything you can hang on a hat would help. For the ETLer that may be something to think about. But for the analyst? That poor wretch has no chance to tweak individual patients. Phenotyping is hard enough the way it is using the data at face value.
If you have a use case of using this type of information for an analytic those flavors of null could be back in business.
We deal with longitudinal observational studies, utilizing both historic and contemporaneously obtained data spanning multiple years.
People occasionally change how they self-identify their race/ethnicity.
Potential cases a person:
of African ancestry may decide that they prefer to be seen as “bi-racial” or “black”.
who, owing to family lore, claimed to be part Cherokee, finds out that it was not true.
adopted by a “white” family as an infant, might learn as an adult that one of their biological parents was half Navaho
It could be of significance to a socio-medical researcher that for some part of the study period, the person believed themselves to have American Indian blood.
As race and ethnicity are self-reported social constructs, not “facts”, that vary widely about the world, I believe that we need to maximize our flexibility. How else would we be able to run a study with sites scattered between Australia, New Zealand, and the US?
While our theoretical partners in New Zealand would seem unlikely to have an interest in whether a subject is Hispanic, being able to tease out the Maori population could be critical. Likewise Aboriginal citizens in Australia.
By the way, what standard “Race” would you expect Australian Aborigines to check off on a survey instrument?
Though there are SNOMED codes for Australian Aborigines and New Zealand Māori, they are neither in the RACE vocabulary nor Standard concepts. How would you code an Australian asserting mixed heritage, one parent being Aboriginal, the other being of British (white) descent?
I disagree! In conducting research into quality of care, I would want to differentiate between subjects who refused to answer a question, didn’t know the answer, and selected an answer for which there is no standard concept. Similarly, it can be very important to distinguish a patient to whom a question was not posed or for whom a metric was not assessed.
I have been following this as Gerry (@Pulver) put it for a project I am the site PI for. First I want to be sure we are aware that a large percentage of people in the USA are born as one race ethnicity and die as another. These are dynamic constructs, like gender anymore, and people change their reporting of them over time. Not sure tracking this is critical but the idea that they are static is not correct. So the argument about Observation v demographic isn’t quite right. Second, diversity, equity and inclusivity are huge issues these days to both our patients and the institutions I work with. Many institutions are very careful about how they capture these data. When you curate data for these institutions they do not like to lose the granularity in their data. Just did a stratified sample for a research project where it was critical to stratify by the types of dual and tri racial people in the population. Finally, remember these are to be self-reported data - we as providers or data managers are not supposed o be changing what is self-reported. That is the US Census model. While we are familiar with the 5 race and 2 ethnicity roll up even the US Census does sampling that includes detailed family information that is part of the ethnicity area if you happen to get the long sample. Finally, as @Pulver mentioned, as soon as you move outside the US these categories blow up. Seems like an internationally acceptable approach should be sought. As noted above, if you are from Australia and have been told all your life the your race is aborigine which is a race in Australia and come to the US it would make perfect sense for that person to answer “other”. Analytically you may roll the all up but making data decisions to decrease granularity instead of analytical decisions of when to roll things together doesn’t seem to me to be the best overall approach.
But we have a system today. If you want to change it, we need:
The analytic use case. Having the data is not a use case. Surely there must be one, if the institutions go through the trouble and collect the information.
A proposal for that use case that delineates how the use case is fulfilled (and the current one does not do the job). And yes, it needs to work globally.
A debate to decide amongst the existing solution, your proposal and the other proposals.
Makes sense?
“Other” as opposed to which “one”? This does not work. We run a Closed World system. We know what we know (what’s in the record), and what is not in the record is not there. If you have an Australian Aborigine you declare it. There is no such research question asking “tell me the outcome XYZ on all patients with the race ‘other’!”. The answer, if there were one, would have little meaning since it would not be interpretable.
Of course, since it is a social more than a biological entity, in the US “Australian Aborigine” does not have the meaning and implication it has in Australia, and any outcomes research will not jive between these two countries. One solution could be that we define ethnicity and racial value sets specific for a geography. But again, you propose, I am just facilitating.
Perhaps I would better understand your perspective if you explained, using the current schema, how you would you populate person.race_concept_id and ethnicity_concept_id in a longitudinal database, using domain-consistent standard concepts, keeping in mind that as you can’t foresee questions which will be addressed by future users, you wish to minimize loss of intelligence.
Situations we may face on large-scale USA-based studies:
Bi-racial person identifies as Asian and African American?
Person identifies as Spanish?
Person identifies as Chamorra?
Person identifies as Chicano?
Person says “I don’t know”?
Person explicitly refuse/declines to answer?
Person writes in “Australian Aborigine”?
Person identifies as African American early in the study, but years later identifies as bi-racial?
Hypothetical cases that i expect our colleagues overseas may be likely to encounter:
Bi-racial person identifies as Australian Aborigine and English?
Person identifies as Spanish?
Person identifies as New Zealand Maori?
Person identifies as Zulu?
Person identifies as Han?
How or would you code the five “overseas” scenarios differently if you were in Europe, Australia, Asia, or Africa?
[It could be helpful if folks from outside the US joined this discussion!]
You could have a concept “Asian-African American” in either race_concept_id or ethnicity_concept_id, depending where you think that should be.
You could have a concept “Spanish” in either race_concept_id or ethnicity_concept_id.
You could have a concept “Chamorra” in either race_concept_id or ethnicity_concept_id.
etc.
You have a record with both fields=0.
You have a record with both fields=0.
You have a concept “Australian Aborigine” in either …
I don’t understand that. We are not running prospective studies in OMOP. Whatever is in the current instance of the database is the fact.
You could have a concept “English-Australian Aborigine” in either…
etc.
But you still haven’t actually provided a use case. How would an “English-Australian Aborigine” be used in a query? Give me all persons who are “English”, or “Half English”, or “Quarter English”, or “Three Eigth English”? Is that a cohort you would study?
If this sounds sarcastic, it is not. I don’t know how to create scientifically valid questions that would give us evidence out of this mosaic information.
Concerning your apparent suggestion that OMOP is useful solely for looking at characteristics at a fixed point in time:
In a longitudinal study we may look at observations occurring prior to and during the period of data collection. Of course, by the time we receive incremental batches of data recorded during the study, they will be retrospective observations. As @Wilson_Pace explained, the demographic characteristics of a patient followed over a period of years can not be assumed to remain static.
You say that we could have all sorts of combination codes. However, as we presently do not have standard domain-compliant concepts for most of the world’s racial/ethnic identities, let alone combinations of them, this does not address my point that it can not be done today using the current US-centric scheme.
Considering your suggestion as an alternative to mine;
I would prefer to record a combined race/ethnicity using multiple codes from “n” concepts, rather than have n! concepts to cover all possible combinations ethnicities/races. To be comprehensive, globally, “n” would be a multi-digit number, which, history suggests, will grow over time.
Sure, most combinations may be extremely unlikely; would a committee be charged with selecting the “likely” combinations and promptly updating the list when an exception arises?
Once we accept that race/ethnicity is dynamic, I don’t want to deny future consumers of the data the ability to track changes over time.
P.S. While I look forward to your reply, I am going to step back from the discussion for now to see how others feel, particularly folks from outside the USA and others who deal with records of subjects whose ethnicity doesn’t neatly fall into the American Indian / Alaskan Native | African American/Black |Asian |Native Hawaiian / Pacific Islander / White & Hispanic structure.
I wasn’t going to reply with a solution, but then I was asked for my opinion
If I were to design a solution, I would remove race and ethnicity from the Person table. I would create a convention to direct ETLers to put all races & ethnicities in the Observation table with observation.observation_concept_id = 3050381, “Race or ethnicity” (honestly doesn’t matter what the concept_id is as long as it means race/ethnicity) and all the individual races and ethnicities would be ETL’d into the observation.value_as_concept_id field. Then I would gather every race/ethnicity/cultural/skin identifier from every source I could find in the world, divide them into singular concepts if needed, de-duplicate them on exact/almost exact text string match and make them all standard concepts with domain_id = ‘Race/Ethnicity’. There would not be a hierarchy because it does not exist in nature in this era of time. The ETLer would have to split up any combos they have in their data and ETL each as a separate row. So, “black hispanic” becomes a row for “black” and a row for “hispanic”. The researchers can then group them however they want. This will require an initial push by the Vocab team, then it will be low maintenance after. Even better, it might be a good use case for the “community contribution” vocabulary project Anna Ostropolets unveiled. Then the community can continue to contribute as new races/ethnicities are discovered/unveiled.
As your proposal largely coincides with ours apart from using “Race or Ethnicity” as observation_concept_id and shifting the various racial identifiers to value_as_concept_id, which, I think is better, I am happy to give it a .
I wonder though, how (or if) the current race_concept_id, ethnicity_concept_id, and related fields would be populated. Though I assume that they would be dropped from the CDM in a future major release, until then, I expect that the two conventions for handling race/ethnicity would need to be supported in parallel. Should the grandfatherly fields be left unpopulated (NULL), or should a flag value of some sort be inserted, to clue analysts to look to Observation for the race/ethnicity? Perhaps there is a convention for handling this predicament.
A flag would be needed. concept_id = 0 could work, but would also include the flavors of null. So, it’s not the best option. “Other” would work for those who identify as one of the non-standard or unlisted races or ethnicities including multi-racial or multi-ethnic. However, I do realize “other” is considered a valid response to What is your race? question when the responder is only given a handful of choices, as @Pulver pointed out:
How about a concept_id = “see all observation.value_as_concept_id values for this person were observation_concept_id = 3050381”? I’m joking, but only slightly. I need other’s input on this
Correct, but we can’t denormalize the OMOP CDM. So, having a concept_id for “other” and then pointing folks to the Observation table is one option. Again, we need input on this.
Correct for parallel support. Possibly for dropping the race & ethnicity data from the Person table. If it’s not here, then look there is not ideal. We need to make the CDM analytic ready. But we also need to take into account the downstream consequences of moving a domain from a “static” table to a “clinical event” table. This is where the use cases really come into play. We need the questions you are asking of the data to correctly model the data. And as you pointed out, we need folks in other countries to chime in on this along with the researchers who have the questions. @Christian_Reich is taking notes on use cases and what is in your source data. Please post them here. We need to know so we can fully model these data and close this issue with a solution that will stand the test of time.
Re use cases Christian, I added several use cases illustrating the need for multiple races to Jake’s proposed changes to race and ethnicity quite a while ago. There are many cases like these where there is a driving need to distinguish among people who self-identify as more than one race because they have distinct health outcomes.
Re what is in our source data Jared can quantify it exactly, but we have a significant number of people we currently assign a “0” to because they report multiple races. That practice is currently creating confusion in our use of OMOP for reporting for projects that are building OMOP-shaped datasets. Workarounds can be found, but it’s a pain.
Re Race is a construct
As I’ve noted elsewhere, many medical diagnoses don’t mean the exact same thing across cultures. Mental health diagnoses in particular, but others also have a different meaning and are ascertained using different methods by people with different training. Since those standards are not and cannot be applied to other OMOP concepts for things like conditions, it seems unwise, or at least inconsistent, to single out race/ethnicity as the set of concepts where those standards have to be met.
Why post-coordinate? Why wouldn’t a race as an observation be self-explanatory and go into observation_concept_id?
This appears hardly possible. Because races have no definition, and neither have ethnicities. They are geographical/social/biological trait-based/ideological categories that people either place other people in, or put themselves in, or a little bit of both. There is no test that would determine whether a Black person in the US or in Nigeria or in the UK is the same race. Or even Black. There is no way to decide whether my Grandmother’s ethnicity was Austrian, Czechoslovakian, Czech, East German or German (she held all passports at some point). We cannot deduplicate, except those concepts that are clearly derived from the same source (e.g. the US OMB race categories, which, btw, have changed over the years). We cannot even distinguish races from ethnicities (think “Indian” in Singapore, which is thought of a race there). That means we cannot create standard concepts (which require to be deduped), unless we allow everybody’s concept as an independent declaration of some category. Which is totally fine with me.
Please bring them on. I haven’t heard much except some kind of regression between racial or ethnic categories and co-morbidities. @rimma was working on a project studying healthcare access.
That’s the one I am strongly supporting.
That is not correct. It is true that especially in mental disease, but also in cancer, there are different schools of categorization, and they often happen to be centered between country or continent lines. But not because somehow different ethnicities go crazy in different ways, or have different malignancies. We are all the same creatures biologically, with very very rare exceptions (a handful of mostly monogenetic diseases). It’s just the way science develops. In fact, the very vast majority of conditions are in full consensus internationally, and passing a final exam in Med School in one country enables you to do the same thing in another country hands down (but do it fast before you start forgetting this massive corpus of encyclopedic information). I know that.
I spoke with the Health Equity WG on Symposium Saturday outlining my proposal for race & ethnicity data in the CDM. The WG meeting wasn’t recorded, but the slides are posted here for anyone who is interested. I’d be happy to review this proposal with all interested persons.
Many reasons:
We do it for the “history of” concepts and it works very well for the ETLer and the end user. The data are easily inserted and retrieved. win-win
Stratifying by “Asian” when the term “Asian” is found in >1000 concepts might be tedious for end users.
Race, ethnicity, indigenous status and other terms used to phenotypically or otherwise describe a person’s country of recent or far in the past country of origin are ambiguous and poorly defined at the global level. These terms are continuously evolving as science evolves, the vocabularies for these terms evolve and more source data are revealed at the global level. The pre-coordinated list of concepts would be extraordinarily long due the lack of a global definition for race, ethnicity, indigenous status, etc. And the list of concepts will continue to grow over time creating an unwieldy set of terms needing to be pre-coordinated.
Happy we are in agreement We split combo terms, “Black African”, “Black Caribbean”, etc. into two separate terms then ETL. And we de-dupe on exact or close to exact string match. Everyone contributes. Stanford’s “Native Hawaiian” and Colorado’s “natives hawaiian” are the same except for the typo and casing of the words. We keep “Native Hawaiian” and Colorado maps “natives hawaiian” to “Native Hawaiian”. We keep it super simple and when in doubt keep both.
I reviewed the proposal.
It is not global enough. See Georgie’s use case here on August 1, 2023.
Leaving these data in the Person table precludes us from identifying the date of this observation and the provenance of these data. And this list will continue to expand.
At this time, the proposal doesn’t cover the following:
Flavors of NULL - flavors of null are not allowed in the CDM and with very few exceptions
Negative values - statements of “it did not occur, it’s not part of the record, negative values, etc”. With a good use case, I can be persuaded to add “non-____” terms.
Hierarchies - Hierarchies are tough for these data. There isn’t a globally inclusive hierarchy at this time. Note - not including hierarchies in the Concept Ancestor table does not preclude users from using their own hierarchies.
Keep it simple and pragmatic for our first implementation. Let’s use the data in the new table with the new concept_ids, run it through some use cases and research, find areas for improvement, regroup, and make a plan for future iterations. This solution is scalable.
This is a HUGE change and will affect many in the OHDSI community from the ETL through the pipeline to the researchers including the following OHDSI groups: Vocabulary, Themis, CDM, DQD, Phenotype Library, Methods & others. This is an open source community, we need folks to volunteer to help with this effort or it will continue to be a topic in the forums, on the Github and in the minds of everyone, but NOT an implementation in the CDM. Who’s joining @Christian_Reich to move this proposal forward?
Christian, I know this is the proposal you are championing. I was responding to the call to post for your notes the linked supporting use cases and our relevant source data issues.
Re our differences on whether biomedical medical model is universal, we can agree to disagree. I think there are clearly culture-bound syndromes in mental health. I’m a bit surprised you don’t seem to think so, but it’s mostly a side issue in this thread. I brought it up here to help clarify whether temporal stability and international universality are applied as criteria to all other data representations especially in light of points made about driving use cases Australia.
Melanie, I’m happy to join Christian in moving this proposal forward if he’ll have me .