Best approach to handle errors in source data

Vimala_Jacob · September 22, 2016, 12:26am

Hello,

We are in the process of transforming admissions data from various state hospitals into the OMOP CDM. However we have come across a few records in the source data, where the admission dates are greater than the provided age (in months) and discharge dates. From past experience, how have other implementers dealt with such seemingly erroneous source data? Is it better to just disregard these records altogether or should we be transforming the offending records before loading into the corresponding OMOP tables?

Regards,
Vimala

Christian_Reich · September 22, 2016, 4:53am

@Vimala_Jacob:

If you don’t know how to fix those records kick them out. We cannot burden the analytics tools to figure this out somehow.

Vimala_Jacob · September 23, 2016, 3:30am

Thank you for your opinion on the matter. Does anyone else have experience in data cleansing methods that they would be willing to share? Just some basic guidelines that could be used as best practice.

Vojtech_Huser · September 30, 2016, 2:38pm

I would not blame a data vendor to delete the data ( that indicates implausible events).

However, I think a better approach than deletion (for an academic center, perhaps) is to assign the patient into ‘pts with errors’ cohort.

And during analysis, we could choose to exclude patients in this error cohort.

That way we can show source data people where the problems are. If it gets deleted, we don’t have a chance to improve the data in the long run.

It kind of defeats the purpose (little bit) to have a drill down feature in Achilles Heel and have Heel at all. It is like “photoshoping your data”.

Vimala_Jacob · October 6, 2016, 9:33pm

Hello Vojtech,

Thanks for your suggestion, which I believe is a good approach as it categorises all of the source data, erroneous or otherwise. And it would help towards improving data quality in the future.

Regards,
Vimala