One of the items that came out of the OHDSI Symposium Themis Working Group meeting was should all persons in raw data be brought over?
Our group has prepared a statement making a recommendation on what type of people to drop from a CDM build. We would like to propose adding this statement on the conventions for PERSON.
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality helps researchers using the CDM. Example scenarios where you might want to exclude a subject from the CDM include: unknown gender, person lacks prescription or health benefits in claims database, personâs year of birth or age are unreasonable (e.g. born in 1800), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient should be made when it makes sense in consideration of the raw data source.
@Rijnbeek has already commented about âwhat if Iâm doing research where gender does not matterâ. Very often gender will play a role in our analysis. For example often a Table 1 will often include gender break outs. At Janssen we used to keep them in almost every time when I built my Table 1 and found I had 1 unknown gender person I was asked to remove and perform the analysis again. Since they caused us more harm than analytical good we have chosen to remove them. Additionally often unknown gender occurs very infrequently (<0.00% of the time in Truven CCAE/MDCR, <0.17% of the time in MDCD - which is slightly higher most likely due to the high birth rate and initial misclassification).
I disagree with this. Iâm not advocating for things to be arbitrary removed however as someone with years of experience with the raw data and the manager of the ETL process, if I learn something from the data vendors that they recommend to exclude it is better if that is implemented in a repeatable, transparent, and standardize way with the CDM build than allowing each statistician to interpret how to implement a rule or even be knowledgeable if they should be doing something in certain situations.
I do agree with you that ETLers should strive to bring as much over as possible but if something is known to be suspect than you do your statistician a service by eliminating it in a standardized way. Additionally the ETL document should discuss all of these coding decisions so it can be clear to the user.
I want to be clear because I know this is a sensitive issue, when Iâm discussing âunknown genderâ here I do not mean to be discussing a switch in gender identification (there is a whole thread that digs more into that). If your data truthfully is able to capture people who experience/identify with a gender change than you should represent that and the Vocabulary should help us represent that. However, at least in claims data, it is more likely administrative error and not because someone has switched their gender. For example, weâve found a handful of people where it looks like their IDs got reused because they were female for a few years, disappeared and suddenly male and a different age.
All these point make me think the language needs to be softened/made clearer. This is a meant to be a recommendation and not a requirement. I can take another crack at it but also open to input! Thank you for the discussion.
I totally agree. Quality management is a central issue.
We have a proposal to indroduce to CDM a general quality system that would allow tagging any row, any concept in a easy, flexible, and I hope powerfull way.
I dâ be glad to make it public. This would allow keeping allow strange rows, document, and make quality a central concept in OMOP.
We still can put 0 in gender_concept_id to capture unknown gender. In this way weâll be able to analyze these patientsâ data if we do not care about their gender.
âpersonâs year of birth or age are unreasonable (e.g. born in 1800), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research)â
agree, in this case we would rather exclude those patients. Quite a common issue.
I think until that is fleshed out we can provide conventions on the set up we currently have. Any new proposal of course can change conventions set.
Agree!
Given everyoneâs feedback above I think people are agreeing that there are allowable scenarios to eliminate a PERSON but suggesting that one of those reasons is âunknown genderâ seems to give people some heart burn. Let me suggest an update to the verbiage:
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a personâs year of birth or age are unreasonable (e.g. born in 1800 or 2999), person lacks prescription or health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table.
What this convention does is lets people know it is okay to remove persons and a few examples to get them thinking of when it might make sense in their own CDM. In the end, in your ETL if you want to drop people for a given reason you can, if you want to keep everyone you can. The best place for these notes to exist are the Metadata table (@Ajit_Londhe ) but as that still matures the ETL document can also document the choices made.
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a personâs year of birth or age are unreasonable (e.g. born in 1800 or 2999), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table.
Types of health coverage are being discussed in WG2
I understand where you are coming from, Nicolas. But we want to explicitly get away from kicking the can down to the analyst or statistician. Because that got us into the mess we have now: They have to make decisions based on lacking information intransparently over and over again, spending a ton of time on these data issues instead of making statistical calculations. And it never gets solved. This is what we are trying to achieve in THEMIS. Data quality should be before data analysis.
I am proposing to add one more sentence that clarifies how an analyst can fetch the removed person count
Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name=âcount of removed personsâ and metada.value_as_string=âxyzâ where xyz is a number (e.g., 12).
To keep METADATA simple, we donât have numeric value and overload the string value (just like strata in achilles results tables do that)
@ericaVoss I like the current text it is only a suggestion not a convention to always remove.
The whole point is to be transparant in your ETL document on the choices made. My personal preference is to clean as much as possible to avoid all analysts have to repeat these steps. Most of these issues will be rare anyway.
Indeed. However I dâsay that Data quality is not necessary removing data. Itâs much more qualifing the data. At least this is the main idea behind the data quality proposal I mentionned earlier. Because data quality is contextual.
On both example in this thread (person without gender information and person with inconsistent birthdate) our local ehr practice here make think that data should not be remove at all.
Person without gender are people trans-gender in our database.
Person with inconsistent birthdate are for most of them, homeless people.
Then should we remove those rows from the CDM ? Here comes a data quality framework, much more relevant than arbitrary and contextual deletion.
quality_concept_id fields in all tables, that allows flexible and alive track of database inconsistency and let statistician confortable with both analysis, and understanding of the database
I was on WG4 today and there were items there too that relate to this post. Consolidating feedback from this post and that group:
It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a personâs year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12)
Only difference: You want to keep the context open, I want to nail it. At least to the degree necessary so you can facilitate remote studies. Meaning, you can send analytic code to someoneâs data and the result will be correct within that context. And instead of âcontextâ we call it âTHEMIS conventionâ.
Understood, but unless you are studying transgenders this is really such a small proportion of people that it doesnât have much influence over the vast majority of use cases: The characterization of populations by gender or the adjustment for confounding by gender.
Look: the convention is not telling people what to do. Only what the context is supposed to be. How the data should behave to be interpretable or analyzable. It is the job of the ETLer to create the best shot at guessing a birthday. And again: Unless we are studying babies the older we get the less precision we need for this.
That is an interesting idea. Please bring it up in the CDM WG, which introduces changes to the CDM based on use cases. THEMIS, on the other hand, is supposed to streamline the conventions of the existing fields and tables.
I like the general wording of it. But we still should consider putting harder language for âJesussesâ or birthdays otherwise imcompatible with life. If we know the record is wrong whatâs the use case for keeping it?
How about persons who have health insurance coverage but no claim? I think they should be retained, and will belong to person, observation period, payer plan period tables, but may have zero visit occurrence records. They contribute to the exposure and health economic analysis.
@Gowtham_Rao good point, that could be added. I agree with you, if someone has coverage but seeks no care they should still be included in the database. Actually maybe it could be its own rule.
RECOMMENDATION #1 It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a personâs year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12).
RULE #1 An ETL should not delete persons who contribute time however have no health care utilization (e.g. an individual enrolled in insurance but does not visit a doctor or pharmacy). This individual will contribute to analysis however as a healthy / non-care seeking individual.
THEMIS WG3 is bringing this up at the THEMIS F2F Day 1 this week.
RECOMMENDATION It is not required that all subjects from the raw data be carried over to the CDM, in fact removing people that are not of high enough quality may help researchers using the CDM. Example scenarios to remove subjects include: a personâs year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source. Reasons for removal of persons should be documented in the ETL documentation and METADATA table (insert row in METADATA where metadata.name='count of removed persons' and metada.value_as_string='xyz' where xyz is a number (e.g., 12).
RULE An ETL should not delete persons who contribute time however have no health care utilization (e.g. an individual enrolled in insurance but does not visit a doctor or pharmacy). This individual will contribute to analysis however as a healthy / non-care seeking individual.