OHDSI Home | Forums | Wiki | Github

Dealing with multiple races and other exceptions

Mostly :slight_smile:

You forgot to mention my solution allows for provenance of the records via the observation_type_concept_id field.

And now is a good time to mention my implementation plan, found on slides 8 - 12.

Before we introduce breaking changes to the CDM and remove the race & ethnicity concept_ids from the Person table, I suggest we make a convention to encourage and allow the use of observation_concept_id = “Has race/ethnicity” to the Observation table. To allow these data to co-exist in both tables until next major/breaking change CDM release. Yes, this will denormalize the CDM, however, it will give us some time to test drive this solution and update cohort definitions before going all in with removal of these data from the Person table. I spoke to @Chris_Knoll at the Symposium and he doesn’t have any concerns about this change for Atlas. Chris suggested I talk to @schuemie, so I pitched it to him. Cohort definitions will have to be updated. Clear and concise documentation on how to ETL the data and how to use the data will be given by Themis & the CDM WG.

Once ETLers have implemented this change, we will need feedback from them on 1. Were the instructions on ETLing these data clear? 2. What are your pre & post change mapping rates? 3. What’s still not mapping?. Next, we’re going to need feedback along the same vein from the analysts: 1. Which use cases now work? 2. Which use cases don’t work? 3. What’s missing?

I am coming at this from the Themis point of view with a strong Health System Interest Group influence. I’d like all of us to keep Themis’ mission statement in mind as we discuss this topic, “Themis makes decisions for the good of the whole community. We must compromise. We can always revisit and modify the convention. Don’t let perfect be the enemy of great. And interoperability between different OMOP CDMs is great!”. I’ll admit, it’s a little cheesy, but we really need the community to follow the standards. You can always add additional fields to your CDM, but need to populate the CDM as expected or we can’t do federated research. And we must comprise, agree to disagree, and move forward. The race topic has been going in circles and infinite loops for years.

With this in mind, I propose we defer the flavors of NULL (unknown, not answered, etc.), hierarchies, and negative values to a future iteration unless there is a strong use case. These items will be easy to add in later, if needed. Let’s use the data with the new concept_ids, run it through some use cases and research, identify areas needing improvement, regroup after running it through the rounds, and then make a plan. Let’s keep it simple and pragmatic for our first implementation.

To echo what @aostropolets said, regardless of which proposal or combination of proposals the community adopts, we need to broadcast to all including: OHDSI chapter leads/WGs, those about to ETL their data, those using the CDM including secondary research groups N3C, All of Us, etc.

Since this is such a huge change and will affect many in the OHDSI community from the ETL through the pipeline to the researchers and the tools used, once a decision has been made, I suggest we form a sub-working group to document and implement the change requested by the community.

3 Likes

I want to voice my support for Melanie’s proposal.

Keeping race ethnicity on Person seems to cause problems with interpretation:

  1. ETL’er will need to decide which race to favor (first recorded in our period, most recent, most frequent? )
  2. the analyst looking at OMOP needs to know what convention was used.

If the research focus is on race, not singling our one of many entries gives more flexibility on methodology used.

Adopting the proposal will lessen burden of custom mapping race cobinations.
The tools impact consideration is important and needs to be addressed.

1 Like

At the danger of angering Christian ( I probably will not be able to attend the workgroup), if we are going to have to track demographics, then make a demographics table. It isn’t that hard and would run faster than trying to pull the data out of observations.
For those of us that are using certain EHR’s, all moving it to observations is doing is making the ETL much harder with no gain in functionality. We have ZERO history of demographics of any sort. I am sure that the billing dept. does, but that data is not accessible to us. Kill and fill means that we always loose any demographics that has changed.

1 Like

@Mark:

No anger! This is the good debate we are having.

We have a demographics table, it’s called PERSON. It does have the necessary fields, and they lack timing. Sounds to me like you are a proponent of Jake’s proposal. Make sure you come to the WG session.

:slightly_smiling_face: I know.

As a compromise, yes.

Hi Friends,

We will meet to ratify the proposals today, December 7th, 5pm EST in the Vocab WG subchannel in Teams. The invite went to the members of this thread as well as all members of the CDM WG. Looking forward to coming to a decision! Recording will be posted after the meeting as well for those who can’t make it.

I attempted to join the WG… but it being M$ Teams, of course it crashed my system.

I cannot join a meeting (off site) via teams and the notes are in otter.ai, which is not an approved app(security) that I can use. Can someone either port the meeting notes to a text file or at least give a TLDR;?

Thanks

Absolutely. You can also watch the recording here.

As per the majority voting, the proposal we would like put forward is as follows:

  • Keep one race and ethnicity (if present) in race_concept_id and ethnicity_concept_id in PERSON table
  • Keep all additional races and ethnicities and/or any longitudinal changes in OBSERVATION table. Provenance of the record can be captured through observation_type_concept_id.
  • We do not deduplicate races to allow greater flexibility given lack of consensus in terms. You can add other races and ethnicities to the OHDSI Vocabularies as long as they are not full duplicates of existing races.
  • No flavors of NULL are permitted, as usual. If race is unknown or not reported it is 0.

There are two implications of this proposal for network studies I can think of:

  1. If race/ethnicity is an inclusion criteria, one will have to look in the OBSERVATION table as well
  2. FeatureExtraction relies on the PERSON table fields to compute corresponding proportions for Table 1. If some entries are in OBSERVATION table, these proportions may be imprecise.

Now we need further feedback from the community.
Please let the Vocab WG know if:

  • You have a research that is impossible to carry with this model,
  • You have/know of tools, queries, scripts or else that will not work with this approach,
  • You have other concerns
1 Like

We need the THEMIS rule how to select race which goes to PERSON table in case of multiple races.
I heard two approaches:

  1. Melanie proposed to simply put ‘Multiracial’ there.
    I liked this approach, but it required vocab team involvment.
    Is there concensus on it and is it going to be added in vocabs as a new concept?

  2. Put the latest race - I guess the latest that maps to standard.
    With the most recent, the races will change quicker than if we got the mode value and the accidental errors will be more pronounced. But on the flip side it is easier to implement and it will capture the latest trends.Somebidy noted that this approach goes with other rules for person fields (as the latest being the correction was made intentionaly).
    Is there consensus on ‘the latest race’ approach or does it still in the discussion/approval stage?

At least we do not have to worry about messing with the demographics table as all the non primaries will go into the junk drawer, as someone calls it.

We only have demographics info, so we will be doing neither, but we do have an orthogonal listing of what the patient declares is the race/eth of said patient, so we know what is to be put in the primary (person) listing. My question is, when do we date the other listings?
What we have is demographics data, there is no date associated with it. As far as our EHR is concerned, all demographics have no associated date but internally are used with the entry date or birth date; that is totally unrealistic for longitudinal data (which is one reason I fought so hard against what was done).

Isn’t that the rule for Observations? The date is the date of the observation record?

Sorry, Entry date of the time the demographics entry was created, which is usually the patient’s first encounter in our system. We have no history of any demographics field, we don’t know when the field was entered/modified, just when the initial record was created and the last modification ts of ~to~ the record, which again, we don’t know what was modified; many times it is a background process that triggers the modify timestamp; useless information for studies but valid information for internal reporting.

Demographics is a stateless table for us. I will put a fake timestamp on it, just tell me what fake timestamp is wanted.

Edit: from the CDM
Depending on the structure of the source data, this may have to be determined based on dates. If an OBSERVATION_DATE occurs within the start and end date of a Visit it is a valid ETL choice to choose the VISIT_OCCURRENCE_ID from the visit that subsumes it, even if not explicitly stated in the data. While not required, an attempt should be made to locate the VISIT_OCCURRENCE_ID of the observation record. If an observation is related to a visit explicitly in the source data, it is possible that the result date of the Observation falls outside of the bounds of the Visit dates.

I can choose:
A. the very first visit date,
B. the very last visit date
C. The birth date of the person
D. 1900-01-01

I like D as it would allow tools to know that there is no match in system to keep from creating false positive linkage.

This is a flavor of null, which I think we should avoid.

It is not a flavor of null; null tells one nothing.
I can just leave this out of my mapping; that is much better than putting false data into the system.

Mark:

If you don’t have a time stamp for this type of information, which is probably going to be very common, I would put in A, or even better, the observation_period_start_date where this thing belongs. Is that date useful? No, but that’s the nature of the problem. For many people, race and ethnicity is a static piece of information. I personally don’t expect to see any changes in the rest of my life. It is the flip side of for supporting use cases where people find it dynamic.

I hate to re-hash old discussions but this does strike me the same way as how ‘history of’ information is put into the CDM. Some people say put the history of obs at the start of their observation period. My personal opinion is that the cdm maintains a record of information about when things were observed, not when they actually happened. Some things are very close to the same dates (like drug exposures and procedures) but other things are not. My question is that is the CDM supposed to make best guess about when the thing actually happened, or just record the observations as they are known?

I would lean to the latter because medical decision making may be done based on the information available at the time and so you might have someone who had a latent disease for a very long time, but only discovered on a certain date, I would argue that the date it was observed is when they may make different medical decisions so we should reflect that in the model. On the other hand, I can understand the argument where you want to describe the actual date of existence to get more accurate results. But I’m not sure we can always define the actual date, but we can definitely get the date when observed, so seems like the former is the more consistent way to go.

So, apologies for derailing the convo on this, but it feels like the same re-occurring problem: what are we trying to represent, and can we represent all information in the same fidelity (actual vs. observed in this case). In the case of other demographics, it seems there’s a shift from things that can be observed over time (ethnicity) and things that are fixed for all time (your birth date). I was a little disappointed that race and ethnicity was lumped into the same bucket because I feel like one is considered fixed while another is possibly considered a social context that can change over time, thus looking at 2 different solutions for 2 different sorts of data problems. But, I respect that the time for that debate is over, but the theme of this type of challenge remains.

1 Like

@Chris_Knoll:

I think you are head-on with this. But the question is not whether to change the observation_start_date from when it was observed to some arbitrary date, but what date do we put in if the source data has no date at all for some fact. Which will be the case with race and ethnicity data very often. And in that case you could have some heuristic on when that information was perhaps collected. In case of claims probably at the beginning of enrollment, which is the beginning of the Observation Period.

I think (may be mistaken) that the next step with conventions is to pass it over to the Themis WG for discussion/ratification. @MPhilofsky could you please advise on the process?

1 Like

Yes, @aostropolets, Themis can give input. I am a little confused. I thought this issue was already voted on during the special Vocab meeting in December 2023. However, you state above you have voted on putting forth a proposal.

So, is this a proposal or has this been decided upon as the solution?

t