OHDSI Home | Forums | Wiki | Github

Best approach to handle errors in source data

Hello,

We are in the process of transforming admissions data from various state hospitals into the OMOP CDM. However we have come across a few records in the source data, where the admission dates are greater than the provided age (in months) and discharge dates. From past experience, how have other implementers dealt with such seemingly erroneous source data? Is it better to just disregard these records altogether or should we be transforming the offending records before loading into the corresponding OMOP tables?

Regards,
Vimala

@Vimala_Jacob:

If you don’t know how to fix those records kick them out. We cannot burden the analytics tools to figure this out somehow.

Thank you for your opinion on the matter. Does anyone else have experience in data cleansing methods that they would be willing to share? Just some basic guidelines that could be used as best practice.

I would not blame a data vendor to delete the data ( that indicates implausible events).

However, I think a better approach than deletion (for an academic center, perhaps) is to assign the patient into ‘pts with errors’ cohort.

And during analysis, we could choose to exclude patients in this error cohort.

That way we can show source data people where the problems are. If it gets deleted, we don’t have a chance to improve the data in the long run.

It kind of defeats the purpose (little bit) to have a drill down feature in Achilles Heel and have Heel at all. It is like “photoshoping your data”.

1 Like

Hello Vojtech,

Thanks for your suggestion, which I believe is a good approach as it categorises all of the source data, erroneous or otherwise. And it would help towards improving data quality in the future.

Regards,
Vimala

t