Eliminating Persons [THEMIS WG3 - TOPIC 2]

ericaVoss · February 6, 2018, 8:01pm

One of the items that came out of the OHDSI Symposium Themis Working Group meeting was should all persons in raw data be brought over?

Our group has prepared a statement making a recommendation on what type of people to drop from a CDM build. We would like to propose adding this statement on the conventions for PERSON.

It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality helps researchers using the CDM. Example scenarios where you might want to exclude a subject from the CDM include: unknown gender, person lacks prescription or health benefits in claims database, person’s year of birth or age are unreasonable (e.g. born in 1800), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient should be made when it makes sense in consideration of the raw data source.

@Rijnbeek has already commented about “what if I’m doing research where gender does not matter”. Very often gender will play a role in our analysis. For example often a Table 1 will often include gender break outs. At Janssen we used to keep them in almost every time when I built my Table 1 and found I had 1 unknown gender person I was asked to remove and perform the analysis again. Since they caused us more harm than analytical good we have chosen to remove them. Additionally often unknown gender occurs very infrequently (<0.00% of the time in Truven CCAE/MDCR, <0.17% of the time in MDCD - which is slightly higher most likely due to the high birth rate and initial misclassification).

Asha_Mahesh · February 6, 2018, 8:22pm

If a field is required and raw data has unreasonable value in that field, then I agree with removal.

parisni · February 6, 2018, 8:58pm

Well I d’say the opposite:
if a raw data has missing or unreasonable value, then this is unreasonable to make the field required.

It makes things way more confortable when statistician knows that no data was removed arbitrarly in the database they are working on.

Let statistician decide wether the data quality is enough or not for their analysis.

What if the study is “data quality” ? I fill like this is dangerous practices to remove data not knowing the use cases by advance.

BTW a unknown gender should be introduced in the cdm, instead of removing all the people that do not fit in that binary classifier.

ericaVoss · February 6, 2018, 10:21pm

I disagree with this. I’m not advocating for things to be arbitrary removed however as someone with years of experience with the raw data and the manager of the ETL process, if I learn something from the data vendors that they recommend to exclude it is better if that is implemented in a repeatable, transparent, and standardize way with the CDM build than allowing each statistician to interpret how to implement a rule or even be knowledgeable if they should be doing something in certain situations.

I do agree with you that ETLers should strive to bring as much over as possible but if something is known to be suspect than you do your statistician a service by eliminating it in a standardized way. Additionally the ETL document should discuss all of these coding decisions so it can be clear to the user.

I want to be clear because I know this is a sensitive issue, when I’m discussing “unknown gender” here I do not mean to be discussing a switch in gender identification (there is a whole thread that digs more into that). If your data truthfully is able to capture people who experience/identify with a gender change than you should represent that and the Vocabulary should help us represent that. However, at least in claims data, it is more likely administrative error and not because someone has switched their gender. For example, we’ve found a handful of people where it looks like their IDs got reused because they were female for a few years, disappeared and suddenly male and a different age.

All these point make me think the language needs to be softened/made clearer. This is a meant to be a recommendation and not a requirement. I can take another crack at it but also open to input! Thank you for the discussion.

parisni · February 9, 2018, 9:53am

I totally agree. Quality management is a central issue.

We have a proposal to indroduce to CDM a general quality system that would allow tagging any row, any concept in a easy, flexible, and I hope powerfull way.
I d’ be glad to make it public. This would allow keeping allow strange rows, document, and make quality a central concept in OMOP.

aostropolets · February 9, 2018, 10:42am

We still can put 0 in gender_concept_id to capture unknown gender. In this way we’ll be able to analyze these patients’ data if we do not care about their gender.
“person’s year of birth or age are unreasonable (e.g. born in 1800), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research)”
agree, in this case we would rather exclude those patients. Quite a common issue.

ericaVoss · February 9, 2018, 8:33pm

I think until that is fleshed out we can provide conventions on the set up we currently have. Any new proposal of course can change conventions set.

Agree!

Given everyone’s feedback above I think people are agreeing that there are allowable scenarios to eliminate a PERSON but suggesting that one of those reasons is “unknown gender” seems to give people some heart burn. Let me suggest an update to the verbiage:

It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in 1800 or 2999), person lacks prescription or health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table.

What this convention does is lets people know it is okay to remove persons and a few examples to get them thinking of when it might make sense in their own CDM. In the end, in your ETL if you want to drop people for a given reason you can, if you want to keep everyone you can. The best place for these notes to exist are the Metadata table (@Ajit_Londhe ) but as that still matures the ETL document can also document the choices made.

Still interested in people’s thoughts . . .

jenniferduryea · February 9, 2018, 9:52pm

Just a little edit…

It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in 1800 or 2999), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table.

Types of health coverage are being discussed in WG2

Christian_Reich · February 12, 2018, 1:26pm

I understand where you are coming from, Nicolas. But we want to explicitly get away from kicking the can down to the analyst or statistician. Because that got us into the mess we have now: They have to make decisions based on lacking information intransparently over and over again, spending a ton of time on these data issues instead of making statistical calculations. And it never gets solved. This is what we are trying to achieve in THEMIS. Data quality should be before data analysis.

Makes sense?

Vojtech_Huser · February 13, 2018, 3:34pm

I am proposing to add one more sentence that clarifies how an analyst can fetch the removed person count

Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name=‘count of removed persons’ and metada.value_as_string=‘xyz’ where xyz is a number (e.g., 12).

To keep METADATA simple, we don’t have numeric value and overload the string value (just like strata in achilles results tables do that)

Rijnbeek · February 13, 2018, 3:51pm

@ericaVoss I like the current text it is only a suggestion not a convention to always remove.

The whole point is to be transparant in your ETL document on the choices made. My personal preference is to clean as much as possible to avoid all analysts have to repeat these steps. Most of these issues will be rare anyway.

parisni · February 13, 2018, 3:56pm

Indeed. However I d’say that Data quality is not necessary removing data. It’s much more qualifing the data. At least this is the main idea behind the data quality proposal I mentionned earlier. Because data quality is contextual.

On both example in this thread (person without gender information and person with inconsistent birthdate) our local ehr practice here make think that data should not be remove at all.

Person without gender are people trans-gender in our database.
Person with inconsistent birthdate are for most of them, homeless people.
Then should we remove those rows from the CDM ? Here comes a data quality framework, much more relevant than arbitrary and contextual deletion.

quality_concept_id fields in all tables, that allows flexible and alive track of database inconsistency and let statistician confortable with both analysis, and understanding of the database

ericaVoss · February 16, 2018, 3:13am

I was on WG4 today and there were items there too that relate to this post. Consolidating feedback from this post and that group:

It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12)

Please continue to provide feedback

Christian_Reich · February 24, 2018, 3:25am

@parisni:

I cannot agree more.

Only difference: You want to keep the context open, I want to nail it. At least to the degree necessary so you can facilitate remote studies. Meaning, you can send analytic code to someone’s data and the result will be correct within that context. And instead of “context” we call it “THEMIS convention”.

Understood, but unless you are studying transgenders this is really such a small proportion of people that it doesn’t have much influence over the vast majority of use cases: The characterization of populations by gender or the adjustment for confounding by gender.

Look: the convention is not telling people what to do. Only what the context is supposed to be. How the data should behave to be interpretable or analyzable. It is the job of the ETLer to create the best shot at guessing a birthday. And again: Unless we are studying babies the older we get the less precision we need for this.

That is an interesting idea. Please bring it up in the CDM WG, which introduces changes to the CDM based on use cases. THEMIS, on the other hand, is supposed to streamline the conventions of the existing fields and tables.

Christian_Reich · February 24, 2018, 3:27am

I like the general wording of it. But we still should consider putting harder language for “Jesusses” or birthdays otherwise imcompatible with life. If we know the record is wrong what’s the use case for keeping it?

Gowtham_Rao · February 24, 2018, 4:13am

How about persons who have health insurance coverage but no claim? I think they should be retained, and will belong to person, observation period, payer plan period tables, but may have zero visit occurrence records. They contribute to the exposure and health economic analysis.

ericaVoss · February 26, 2018, 4:22pm

@Gowtham_Rao good point, that could be added. I agree with you, if someone has coverage but seeks no care they should still be included in the database. Actually maybe it could be its own rule.

RECOMMENDATION #1
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12).

RULE #1
An ETL should not delete persons who contribute time however have no health care utilization (e.g. an individual enrolled in insurance but does not visit a doctor or pharmacy). This individual will contribute to analysis however as a healthy / non-care seeking individual.

Christian_Reich · February 26, 2018, 4:55pm

Sounds like “persons doing time”.

ericaVoss · March 5, 2018, 6:56pm

THEMIS WG3 is bringing this up at the THEMIS F2F Day 1 this week.

RECOMMENDATION
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12).

RULE
An ETL should not delete persons who contribute time however have no health care utilization (e.g. an individual enrolled in insurance but does not visit a doctor or pharmacy). This individual will contribute to analysis however as a healthy / non-care seeking individual.

ACTION
Work with @clairblacketer to have posted on the PERSON page under the CDM Wiki under convensions. https://github.com/OHDSI/CommonDataModel/wiki/PERSON

clairblacketer · March 6, 2018, 1:54pm

@ericaVoss After the THEMIS F2F I would like to add all resulting recommendations to the CDM documentation. I’ll put this one on the top of the list