OHDSI Home | Forums | Wiki | Github

Data Quality Checking

So we have data that I’m working with a vendor to map to OMOP in a Semantic and Graph database. We have exported the data and have an instance of Atlas running where the data are then going to be imported. This will provide proof of concept that we can provide the data to others and they can read it.

What I need to do is check were the data imported correctly and the integrity maintained. We want to QC/QA the data in Atlas. I can’t seem to find ways to do this though I admit to knowing only the bare minimum about Atlas. I’m seeking guidance on this process. Perhaps a validation process of some sort.

There is active work building a QC/QA specification and tool, called the Data Quality Dashboard. There is progress on SQL Templates to check these constraints. The objective is to have something that can be used by the 2019 OHDSI Symposium on Sept 16th.

So it seems this is a job best done by folks with programming knowledge. With regards to the graphs and charts it would seem some visual checking vs the underlying data are needed to confirm they are displaying the data accurately/pulling from the right source.I’m trying to put together a UAT process for checking the data imported into ATLAS. Are the logs or error files for the importing of files?

In current version of Atlas, I don’t think there is much in terms of DQ.

You would need to open RStudio, and do few lines of R code using the Achilles R package.

So at this point how do we know that the synpuf data weren’t farkled when imported into Atlas?

You don’t, @jliddil1, and it won’t matter. SynPUF are synthetic data. You must not use them to draw any insights or research study results. It’s entirely for for training, testing and demoing.

Right. So if we bring our data into Atlas we don’t know if it is farkled. Coming from a Pharma background I’m all about compliance, 21 CRF Part 11 etc.Would the import process pass a validation audit?

IF you run Achilles, you can see some basic quality things in the Heel reports (patients with data outside of observation period, data before birth, after death, orphaned data points and the like). You can see funky stuff on the Achilles plots if there was some issue in the ETL for sure, but this is very empirical and depends on how much you know your own data. Any other DQ procedures should be part of your ETL process. Designed as unit tests, verifying counts of certain drugs/conditions/etc. should be created on site. In other words, your ETL process passing an audit is 100% dependent on how well you crafted it :slight_smile:

@jliddil1 maybe I am misunderstanding your process but typically ATLAS doesn’t house any data. Rather, it points to an instance of an OMOP CDM sitting somewhere else. Therefore anything coming out of ATLAS will match anything run on the CDM itself as they are referencing the same database. The best way to double check this would be to create a cohort in ATLAS and then run the sql ATLAS generates on the source CDM.

If your question is more around quality checking the conversion, there are a few ways to do that as outlined by @Juan_Banda, @Vojtech_Huser and @cce. We are currently working on a data quality dashboard that runs a set of checks on your CDM based on the CDM specifications and vocabulary mapping expectations. Once that work is complete, it will create a report outlining the checks that were run, the number and percent of rows that failed, and the sql code to retrieve the offending rows.

Clair

1 Like

Will probably repeat multiple things that people have already said.

When our team is testing the results of our OMOP CDM ETL (or validating someone’s OMOP CDM), we do in multiple steps:

  • Test data in the OMOP CDM from an integrity perspective e.g. referential integrity, key uniqueness, constraints, allowed value ranges etc… Some database offer DB constraints that should definitely be enabled after data is loaded.
  • Check mappings rates e.g. how no. of records successfully mapped into standard values
  • Check compliance with THEMIS business rules
  • Look for any anomalies in ACHILLES reports. For example data shifts or spikes, absence of data etc…

(with the DQ Dashboard that our OHDSI team is working on, we will be able to standardize on and automate several things I listed above)

Then, as a part of UAT - we often take some of the existing queries and analyses that were successfully completed on source (“raw”) data and replicate those on OMOP CDM data and validate the results. It is not necessarily will produce the same results (and most likely will not) but gives a good idea on what changed during the ETL process. You can use ATLAS for designing these cohorts or analyses - but you do not have to. So, If you already have some existing queries against your Semantic database, you could convert it into OMOP CDM dialect and run it to compare the output in both cases.

No such thing for observational research. You can do a thorough job for your software implementations, you can even do some software system validation. But this has nothing to do with any predicate rule affecting the process of an NDA, or any GxP regulations. Let’s not confuse these things. Part 11 in particular is about electronic records and electronic signatures in lieu of their paper equivalents. We don’t have that problem. Or better, we are on the other end of the spectrum: our data are by definition not controlled, and the original paper records that once existed were not collected for the purpose of any activity regulated by the FDA.

Right, But ow we are in the world of real world evidence and real time data. The FDA and patient advocates are asking for these data. We have Pharma asking to mine these data and use the for indication approvals. So maybe we aren’t there yet but I can tell you it is coming down the road fast. I’ve been there done that of the “it’s not for registrational purposes” only to have it be used for Phase 4 approvals and we scrambled to clean it etc.

I totally agree. And we have to figure out a way to support the quality of our evidence generation, without the traditional logic “only perfect data will guarantee perfect results”. It will not happen, no matter how much we jump up and down. But the regulatory community isn’t there yet, they still follow their reflexes. And we haven’t done our homework, yet.

t