Defining implausible values from EMR data

byr0sxj · February 8, 2019, 9:31am

Hi OHDSI members,

I was wondering if you know how the implausible data are handled during the process of converting the raw data to OMOP formatted data. For instance, what if you have negative values for days of supply or metric quantity? Do you have any guidelines on handling those values?

Many thanks in advance!

Mariana

Christian_Reich · February 8, 2019, 10:40am

@byr0sxj:

You cannot let them in. THEMIS will report them, and ATLAS and other analytics don’t know what to do with them.

The question is what you do about them: Deleting, setting to 0, using the absolute value, finding the positive sibling and deleting it as well… That depends on the source data. Only you (hopefully) knows what a negative value means, and can write the ETL accordingly to fix the issue.

Wilson_Pace · February 8, 2019, 12:36pm

We look for all kinds of implausible values - such as a systolic blood pressure of 1200 and convert them to a negative 9 to indicate that the measureable activity was performed (as sometimes that is all you care about) but that the measured or observed value is not valid. We do not keep the actual implausible value as these values are sometime fairly unique and thus could help someone re-identify data. So, at least in our implementation, your negative number would be converted to a standard negative number but the row of information would be retained. When performing analyses that require actual values it is easy to excluce all negative numbers but a few blood pressures of 1200 can really skew your mean BP
I know this causes problems with some other systems but the data loss caused us more problems in creating quality metrics.

Christian_Reich · February 8, 2019, 1:01pm

@Wilson_Pace:

You know that you will get reprimanded for this statement, don’t you.

There is no “data loss” when you discard a negative value that cannot be negative. Or is otherwise known as false. There is no value in keeping it. All you do is to thwart standardized analytics, because it will use the value. Same is true for those “9999” values, often meaning something like “sample corrupted” or so. When calculating averages and distributions, those values will be used.

So, please don’t. As soon as the THEMIS police is ready you will be called out as “non-compliant”, and network studies will not run.

Andrew · February 8, 2019, 2:00pm

OK, I’m asking for trouble @Christian_Reich.
I think there is quite a bit of value in keeping the data needed to construct the following somewhere:

(A) The frequency of the occurrence of clearly incorrect data in source X for values of Y and, if you make any substitutions or deletions of Y in the ETL,
(B) the fact that some other value has been substituted for it and
(C) what value was substituted for the original value

A is essential to inform the level of trust we should have in source X for values of Y. Everyone should feel less confidence in using X as a source for Y if a third of the values are bad than if .00001% are bad. We need a way to capture and use that information. B and C are needed for sensitivity analyses to assess the impact of alternative choices of substituted values in analyses that include Y from X.

I agree that we can’t simply keep values that will introduce error into studies (implausible values or "9999"s) without some reliable method for removing them from analyses. But I strongly disagree that we should simply delete implausible values, if that’s what you are suggesting. We need to keep a standard machine readable representation of that fact in some standard location. Otherwise, we are ignoring important information that can and should be used to assess data quality and its contribution to measurement error. Or, if it’s only noted in a document somewhere, it’s not in a form that can be used efficiently in analyses.

A convention like “9999” if used consistently - and that consistent use could be a Themis check - can identify these and leave them out. I don’t have a strong opinion on what strategy is used to keep known bad values out of analyses. It only matters in the end that:

Some standard is followed that prevents incorrect data from being used in analyses
Important data quality information is not buried in the ETL process, and
Value transformations from the source are captured and represented in a way that allows standardized sensitivity analyses