Data Quality - Share your experience with Achilles Heel tool here

Vojtech_Huser · April 7, 2015, 4:45pm

I writing a paper about Achilles Heel and I would like to hear from sites that discovered errors in their ETL or in source data using Achilles Heel.

For example, Achilles Heel (part of Achilles) can alert you that there are events for patients prior birth or after death.

A complete list of issues addressed by Achilles Heel is available in this earlier post: Achilles Heel initial discussion.

For IMEDS Lab users, reports for CCAE, GE and all other datasets in IMEDS are available via a link in the Cloud Lab. (email me if you want to get to it (due to security)) (similar to a public report here: http://www.ohdsi.org/web/achilles/#/SAMPLE/achillesheel)

I would also like to hear from people that were not able to install Achilles and why. For example, here at NIH, I don’t have non-active directory login to our database and I was not able to make the Achilles work at all. (since I think it requires some login to the database). But we had a separate - non-heel effort about data quality that I used on NIH data. (e.g., patients living more than 130+ years)

mark_velez · April 7, 2015, 6:20pm

Hello Vojtech. At Columbia we certainly have found errors both in the source data and in the ETL using Achilles Heel. For example, we found that our ETL resulted in many records being outside of observation periods. Patients born far too long ago or even in the future due to source data issues were also revealed to us in the Heel report.

Several issues were detected or better understood through other Achilles reports. For example, the aforementioned ancient and future-born patients were more quickly noticed in the Person report’s histogram. The Drug Era report helped us discover a bug in the ETL when the prevalence of one drug era was visually misrepresented in the tree map from what we knew in reality.

Hope that helps. Happy to answer any questions.

rkboyce · April 7, 2015, 8:28pm

For doing an ETL of Nursing Home (NH) data, including the NH Minimum Data Set, the tool helped us identify the following:

Our ETL had originally truncated observation periods for patients with data near the start and end dates of the data pull. The cumulative data graph in the dashboard made this issue obvious
several cases where our load process had incorrectly inferred discharge dates prior to admission (resulting in bad observation periods). Similar issues were identified with drug eras, some related to the source data and some to the ETL process.
an error with our ETL that was causing data from the assessment of cognitive skills for daily decision making to be ignored during translation which left
many empty observation entries
We noticed that a number of observations were triggering the “Number of observation records with no value” error - upon investigation, we learned that we could address this issue if we requested that LOINC answer codes be included in the next vocabulary update - they have been and we are currently filling in the holes
Along the same lines, the achilles reports and analytics motivated us to do further work to have a complete ETL to the standard vocab because of the benefits to data quality assurance

Vojtech_Huser · April 29, 2015, 3:55pm

Thank you to all who responded so far.
I would like to include even more sites - so, please reply to this thread or me.

For the paper, we decided to compare Heel output data (CSV file produced by Achilles). We have 5 participating sites and a total of 16 datasets analyzed at this point.

If you are interested in also contributing your site Heel data (no person level data), please le me know. (for example I would love to have Regenstrief data, @jon_duke … )

For example, the most common ruleIDs across datasets are:

717  Distribution  of quantity by drug_concept_id; max 
600  Proc: Concepts  in data are not in correct vocabulary (CPT4/HCPCS/ICD9P)
101 Person: Number  of persons by age, with age at first observation period; should not have age  < 0

Patrick_Ryan · May 2, 2015, 11:14pm

A related topic: Congrats to @mgkahn, @schillil, @Andrew, @Daniella_Meeker on their recent publication in eGEMS on data qualiity reporting: http://repository.academyhealth.org/cgi/viewcontent.cgi?article=1052&context=egems. This could be useful context for @Vojtech_Huser 's effort to provide a community summary of data quality issues. Given @toanong and @callahantiff 's related work in this area, it seems there’d be a lot to contribute around how the observed data quality issues from ACHILLES HEEL fit into reporting and recommended best practices for any organizations with patient-level observational data.

writetoritu · June 2, 2015, 3:40pm

HI @Vojtech_Huser are you still accepting inputs from other sites. We would like to share our experience of using Achilles Heel in the PEDSnet project. Please let me know if you are still open to contributions and I can share some feedback and summarized results.

Thanks!
Ritu

Vojtech_Huser · June 9, 2015, 6:05pm

Yes. New sites can still join the evaluation of Achilles Heel and work on the resulting paper. Please send me a regular email to my work email to coordinate.

To also provide some update to the work:
We continue to work on the paper. Sunny Shang also piloted some classification of the quality rules.

A github version of Achilles (after my fork and merge by Chris) now distinguishes a Heel sub-analysis on top of a Achilles analysis (e.g., 113-1 is the rule for checking pre-birth events and 113-2 is the rule for post mortem events).

The dilema is which CDM version to focus on (v5 vs v4) (probably v5 and port back at certain time points the changes from v5 to v4).

We did a mini email survey of sites on their use of Achilles Heel. (80% response rate).

Existing included sites are hereby encouraged to check the manuscript draft (in the cloud) and post comments.

Vojtech_Huser · February 1, 2016, 3:23pm

I wanted to share an update on Achilles Heel.
After Data Quality Code-a-thon, me and Chris Knoll did few changes to Achilles Heel.

each quality rule now outputs some quantitative data into separate
(newly created) columns (so that the extend of the problem can be quantified) (eg,
number of rule offending rows).
I created an overview of the rules and introduced a rule_id. See the table here https://github.com/OHDSI/Achilles/blob/master/inst/csv/achilles_rule.csv

v4 and v5 versions of Heel exist. But v5 is what we may focus most in future extensions.

If you recently ran Achilles and Achilles Heel, please post your experience here (even if you just ran it and saw no errors).

Vojtech_Huser · February 18, 2016, 2:36pm

I want to post few updates about Achilles Heel (and also IMEDS (indirectly)).

IMEDS now has one dataset in CDM v5. I was able to test v5 versions of remade Heel data quality checks.

a sample output is shown below (table achilles_heel_results). Columns 3 and 4 are the new additions to Achilles (preliminary called version 1.2 by me)

To see the output (without updating your achilles web application, you can use new R command fetchAchillesHeelResults

Vojtech_Huser · August 3, 2016, 12:42pm

For Heel users, the latest Achilles version now has improved overviews of the rules and derived analyses. (there were many questions about the rules on the forum lately)

See CSV files here:

Two of the CSV files are also made into an html overview in Extras folder

Rule Drill Down is a new planned feature allowing to know which data rows trigger a given DQ rule.

Vojtech_Huser · September 20, 2016, 2:59pm

Two updates:
1.
I am pleased to announce that the Heel Evaluation Study has been now fully accepted. (proofs are being generated - so link to PDF is comming soon)

citation would be

Huser V, DeFalco F, Schuemie M, Ryan P, Shang N, Velez M, Park R, Boyce R, Duke J, Khare R et al: Multi-site Evaluation of a Data Quality Tool for Patient-Level Clinical Datasets. eGEMs 2016.

Also, I am organizing a new 2-3 iterations study to set thresholds to some of the new rules in Achilles 1.3.

Please email me (vojtech.huser at nih dot gov) if you have Achilles running and would be willing to run 4 lines of R code.
See the study here:
http://www.ohdsi.org/web/wiki/doku.php?id=research:dqstudy
and here

Andrew · September 28, 2016, 9:15pm

@Vojtech_Huser I get this error when installing any Achilles-required packages from github

 Error in curl::curl_fetch_disk(url, x$path, handle = handle) : 
  Peer certificate cannot be authenticated with given CA certificates

Andrew · September 29, 2016, 2:25pm

I found the solution to the above CA certificates problem. Posting here in case others run into the same cURL issue.

> library(httr)
> set_config(config(ssl_verifypeer = 0L)

Running these first does the trick.

Vojtech_Huser · November 30, 2016, 4:23pm

The Heel-Evaluation-Study manuscript has finished the proof stage (after several weeks) and is now officially published. The link to the journal site is:

http://repository.edm-forum.org/egems/vol4/iss1/24/

Vojtech_Huser · June 20, 2017, 2:37pm

I am pleased to say that some Data Quality study (see research:dqstudy [Observational Health Data Sciences and Informatics]) outputs were incorporated in the latest update to Achilles (version 1.4.6 )

The changes are in the latest pull request and also pasted here.

The Achilles readme file was updated with some description of the rules and pre-computations. (see it here
GitHub - OHDSI/Achilles: Automated Characterization of Health Information at Large-scale Longitudinal Evidence Systems (ACHILLES) - descriptive statistics about a OMOP CDM database)

Of note is that there are a total of 8 SQL dialects in SQL Render
See here SqlRender/vignettes/UsingSqlRender.Rmd at main · OHDSI/SqlRender · GitHub

I would like to work with the developers to also possible test the Heel component on Impala, Bigquery and redshift. If you have this environment, please report issues with executing Hee (here https://github.com/OHDSI/Achilles/blob/master/inst/sql/sql_server/AchillesHeel_v5.sql ) to this thread so that Heel code can be tweaked to work on all 8 dialects.