Eliminating Persons [THEMIS WG3 - TOPIC 2]

ericaVoss · June 25, 2018, 4:31pm

Any final feedback?

Sulev_Reisberg · November 12, 2018, 2:40pm

I’m working with bringing whole Estonian EHR documents (yes!) to OMOP model and the same issue strikes…

In short, for approx 1% cases the patient birth year is not recorded in our EHR as it is not mandatory to record.
Unfortunately, birth year is mandatory in OMOP. What concerns me most is the recommendation “For data sources where the year of birth is not available, the approximate year of birth is derived based on any age group categorization available.” (https://github.com/OHDSI/CommonDataModel/wiki/PERSON)
I think this suggestion is made for data sources that are missing birth years in their structure completely, but making assumptions of the birth year is not a good practice anyways, I think.

However, missing birth year does not necessarily indicate poor data quality. For several research questions (e.g estimating healthcare costs for a nation) this is not a necessary parameter. Similarly, the gender. I’d prefer missing birth year to “estimated year of birth”.

Thus, from my point of view, data quality is always something that depends on the question we are trying to answer. Sometimes it is important to have the birth year information, but sometimes it is more important to have complete data (of the nation) with the costs, even without the exact birth years/genders.

I would be very concerned of dropping 1% of the data just because the birth year is a mandatory field. Any suggestions to solve this?

Sulev_Reisberg · November 12, 2018, 3:06pm

Just a clarification. There seem to be two different questions in this thread - “possibly incorrect value” and “missing value”. I agree that bringing over (clearly) incorrect values should be avoided. However, instead of dropping the records, I would suggest turning these values to “unknown” and therefore allow missing values, even for birth year.

ericaVoss · November 12, 2018, 6:50pm

Yeah, I think this initial statement is old and maybe should be updated or removed: “For data sources where the year of birth is not available, the approximate year of birth is derived based on any age group categorization available.”.

Later we do say “Example scenarios to remove subjects include: a person’s year of birth or age are unreasonable (e.g. born in year 0, 1800, 2999 or just lacking a year of birth), person lacks health benefits in claims database (i.e. thus you do not have a complete picture of their record), or raw data states that the person may not be of high research quality (e.g. CPRD will actually suggest which people not to use within research). Removal of a patient is not required and should be made in consideration of the raw data source.”

However with all of that said, you have to think of your use cases. For us most analysis would include an age component so I would probably just eliminate those folks.

Christian_Reich · November 16, 2018, 7:47am

I am glad you are bringing this up, because it is often source for concern and anxiety. The use cases we generally support are observational research on longitudinal patient records. In other words, what happens to a group of patients over time. These types of analytics almost always require the correct age, otherwise interpretation of the results is nearly impossible. As a community, we decided to make at least the age in years mandatory, because having to exclude missing birth years in every query is an enormous extra burden.

Healthcare cost for the nation is what we call “cross-sectional” research. It doesn’t care about individual patients, it takes a snapshot of everything that happens at a point in time. Another example is total amount of drug used in the nation, etc. The OMOP CDM is not designed around supporting these. And in general very few databases have nation-wide patient-level data or allow what’s called “projection” of records to the national level. Our data tend to be sampling some distinct population (hospital, insurance company, etc.)

This reflex I really don’t understand. We are not hoarders. We use the data to make inferences based on them. And if almost all analytic that is not cross-sectional will have to exclude these patients anyway you don’t gain anything by holding on to them.

Christian_Reich · November 16, 2018, 7:50am

BTW, that is awesome, @Sulev_Reisberg.