OHDSI Home | Forums | Wiki | Github

Eliminating Persons [THEMIS WG3 - TOPIC 2]

I think until that is fleshed out we can provide conventions on the set up we currently have. Any new proposal of course can change conventions set.


Agree!


Given everyone’s feedback above I think people are agreeing that there are allowable scenarios to eliminate a PERSON but suggesting that one of those reasons is “unknown gender” seems to give people some heart burn. Let me suggest an update to the verbiage:

It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in 1800 or 2999), person lacks prescription or health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table.

What this convention does is lets people know it is okay to remove persons and a few examples to get them thinking of when it might make sense in their own CDM. In the end, in your ETL if you want to drop people for a given reason you can, if you want to keep everyone you can. The best place for these notes to exist are the Metadata table (@Ajit_Londhe :sunglasses: ) but as that still matures the ETL document can also document the choices made.

Still interested in people’s thoughts . . .

Just a little edit…

It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in 1800 or 2999), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table.

Types of health coverage are being discussed in WG2 :smile:

1 Like

I understand where you are coming from, Nicolas. But we want to explicitly get away from kicking the can down to the analyst or statistician. Because that got us into the mess we have now: They have to make decisions based on lacking information intransparently over and over again, spending a ton of time on these data issues instead of making statistical calculations. And it never gets solved. This is what we are trying to achieve in THEMIS. Data quality should be before data analysis.

Makes sense?

1 Like

I am proposing to add one more sentence that clarifies how an analyst can fetch the removed person count

Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name=‘count of removed persons’ and metada.value_as_string=‘xyz’ where xyz is a number (e.g., 12).

To keep METADATA simple, we don’t have numeric value and overload the string value (just like strata in achilles results tables do that)

@ericaVoss I like the current text it is only a suggestion not a convention to always remove.

The whole point is to be transparant in your ETL document on the choices made. My personal preference is to clean as much as possible to avoid all analysts have to repeat these steps. Most of these issues will be rare anyway.

1 Like

Indeed. However I d’say that Data quality is not necessary removing data. It’s much more qualifing the data. At least this is the main idea behind the data quality proposal I mentionned earlier. Because data quality is contextual.

On both example in this thread (person without gender information and person with inconsistent birthdate) our local ehr practice here make think that data should not be remove at all.

  • Person without gender are people trans-gender in our database.
  • Person with inconsistent birthdate are for most of them, homeless people.
    Then should we remove those rows from the CDM ? Here comes a data quality framework, much more relevant than arbitrary and contextual deletion.

quality_concept_id fields in all tables, that allows flexible and alive track of database inconsistency and let statistician confortable with both analysis, and understanding of the database

I was on WG4 today and there were items there too that relate to this post. Consolidating feedback from this post and that group:

It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12)

Please continue to provide feedback

1 Like

@parisni:

I cannot agree more.

Only difference: You want to keep the context open, I want to nail it. At least to the degree necessary so you can facilitate remote studies. Meaning, you can send analytic code to someone’s data and the result will be correct within that context. And instead of “context” we call it “THEMIS convention”.

Understood, but unless you are studying transgenders this is really such a small proportion of people that it doesn’t have much influence over the vast majority of use cases: The characterization of populations by gender or the adjustment for confounding by gender.

Look: the convention is not telling people what to do. Only what the context is supposed to be. How the data should behave to be interpretable or analyzable. It is the job of the ETLer to create the best shot at guessing a birthday. And again: Unless we are studying babies the older we get the less precision we need for this.

That is an interesting idea. Please bring it up in the CDM WG, which introduces changes to the CDM based on use cases. THEMIS, on the other hand, is supposed to streamline the conventions of the existing fields and tables.

I like the general wording of it. But we still should consider putting harder language for “Jesusses” or birthdays otherwise imcompatible with life. If we know the record is wrong what’s the use case for keeping it?

How about persons who have health insurance coverage but no claim? I think they should be retained, and will belong to person, observation period, payer plan period tables, but may have zero visit occurrence records. They contribute to the exposure and health economic analysis.

@Gowtham_Rao good point, that could be added. I agree with you, if someone has coverage but seeks no care they should still be included in the database. Actually maybe it could be its own rule.

RECOMMENDATION #1
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12).

RULE #1
An ETL should not delete persons who contribute time however have no health care utilization (e.g. an individual enrolled in insurance but does not visit a doctor or pharmacy). This individual will contribute to analysis however as a healthy / non-care seeking individual.

1 Like

Sounds like “persons doing time”. :smile:

THEMIS WG3 is bringing this up at the THEMIS F2F Day 1 this week.


RECOMMENDATION
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12).

RULE
An ETL should not delete persons who contribute time however have no health care utilization (e.g. an individual enrolled in insurance but does not visit a doctor or pharmacy). This individual will contribute to analysis however as a healthy / non-care seeking individual.

ACTION
Work with @clairblacketer to have posted on the PERSON page under the CDM Wiki under convensions. https://github.com/OHDSI/CommonDataModel/wiki/PERSON

@ericaVoss After the THEMIS F2F I would like to add all resulting recommendations to the CDM documentation. I’ll put this one on the top of the list

1 Like

Any final feedback?

I’m working with bringing whole Estonian EHR documents (yes!) to OMOP model and the same issue strikes…

In short, for approx 1% cases the patient birth year is not recorded in our EHR as it is not mandatory to record.
Unfortunately, birth year is mandatory in OMOP. What concerns me most is the recommendation “For data sources where the year of birth is not available, the approximate year of birth is derived based on any age group categorization available.” (https://github.com/OHDSI/CommonDataModel/wiki/PERSON)
I think this suggestion is made for data sources that are missing birth years in their structure completely, but making assumptions of the birth year is not a good practice anyways, I think.

However, missing birth year does not necessarily indicate poor data quality. For several research questions (e.g estimating healthcare costs for a nation) this is not a necessary parameter. Similarly, the gender. I’d prefer missing birth year to “estimated year of birth”.

Thus, from my point of view, data quality is always something that depends on the question we are trying to answer. Sometimes it is important to have the birth year information, but sometimes it is more important to have complete data (of the nation) with the costs, even without the exact birth years/genders.

I would be very concerned of dropping 1% of the data just because the birth year is a mandatory field. Any suggestions to solve this?

1 Like

Just a clarification. There seem to be two different questions in this thread - “possibly incorrect value” and “missing value”. I agree that bringing over (clearly) incorrect values should be avoided. However, instead of dropping the records, I would suggest turning these values to “unknown” and therefore allow missing values, even for birth year.

Yeah, I think this initial statement is old and maybe should be updated or removed: “For data sources where the year of birth is not available, the approximate year of birth is derived based on any age group categorization available.”.

Later we do say “Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source.”

However with all of that said, you have to think of your use cases. For us most analysis would include an age component so I would probably just eliminate those folks.

I am glad you are bringing this up, because it is often source for concern and anxiety. The use cases we generally support are observational research on longitudinal patient records. In other words, what happens to a group of patients over time. These types of analytics almost always require the correct age, otherwise interpretation of the results is nearly impossible. As a community, we decided to make at least the age in years mandatory, because having to exclude missing birth years in every query is an enormous extra burden.

Healthcare cost for the nation is what we call “cross-sectional” research. It doesn’t care about individual patients, it takes a snapshot of everything that happens at a point in time. Another example is total amount of drug used in the nation, etc. The OMOP CDM is not designed around supporting these. And in general very few databases have nation-wide patient-level data or allow what’s called “projection” of records to the national level. Our data tend to be sampling some distinct population (hospital, insurance company, etc.)

This reflex I really don’t understand. We are not hoarders. We use the data to make inferences based on them. And if almost all analytic that is not cross-sectional will have to exclude these patients anyway you don’t gain anything by holding on to them.

BTW, that is awesome, @Sulev_Reisberg.

1 Like
t