As most of us know, data quality has become increasingly important as a factor that regulatory agencies use to determine if a database can be considered ‘fit-for-use’ when it comes to informing decisions. To that end, I have volunteered to lead a new effort around creating an OHDSI data quality dashboard. The goal is to agree upon a set of data quality checks we would like to run against an OMOP CDM instance, on top of which can sit a dashboard of some type. I have an initial design that lays out some checks and potential UI, all working within the Kahn framework but there is still work to be done before implementation can begin. The remaining questions are
Are the checks we have listed enough for a version 1 of the dashboard/tool?
Will there be consideration for trends over time?
Will there be a process to add new rules?
Will there be a data quality check for source mappings?
How will benchmark values be decided on?
We will need to make sure the tool can provide a way to drill down to the individual flagged rows.
How do we handle checks that are always red or always green?
Many in the community have already offered their expertise: @Andrew, Ajit Londhe, @davidcarnahan, @rtmill, @Vojtech_Huser, Mui Van Zandt, @Rijnbeek, Maxim Moinat, Mark Khayter, @DTorok, @cukarthik, Frank DeFalco, Christian Reich, @mgkahn, Clark Evans, Greg Klebanov, @Patrick_Ryan, @SCYou, and Tim Berquist. Please let me know if you would like to join - the plan is to meet next week to go over the current design and answer some of the questions above; here is a link to a doodle poll to fill out https://doodle.com/poll/evytwqhh7r3fw9cq. Once I figure out a good time for everyone I will post meeting information to this thread.
I am really excited about this effort and I am looking forward to everyone’s ideas!
Clair
Apparently I can only mention 10 users in a post which is why not everyone is tagged
Thank you to all who participated in the doodle poll. The time that works best for everyone is tomorrow, June 19, 2019 at 10am eastern. Here is the meeting link:
Clair, Thank you for hosting this meeting. Given the urgency and priority of this initiative, I’m wondering if it’d be appropriate for it to have its own “Data Quality” category here in our forums? Second, I suggested that a quality control effort may want to have regression tests as the 1st development stage, and this connects this work with the urgent need for a demo database which @schuemie has started, so that we could validate that the checks are producing the sort of results we expect. Third, I suggested that rather than start with implementing, we could start by producing the expected output of the tool, in the expected output format – this way we could have a more concrete discussion of what the scope of the project is, and those who like to work on user interfaces would have something to target now, rather than waiting for a code drop. Having a tight community-oriented feedback loop with regression tests and sample outputs are an important step for a successful delivery. This way we can get user feedback working before we even start to write code.
We had a productive discussion on Wednesday and I appreciate everyone who could join and give feedback. We are working within a narrow scope of assessing data quality at the CDM specification level for a v1. Below are the links to the documents we are using to describe the design of the dashboard from a ground-up perspective as well as the proposed quality checks we would like to implement for a phase 1. Please take a look and leave comments if the checks do not make sense or if there are any that seem out of scope. Additionally, if there are any you see that could fit into the empty Kahn categories feel free to leave a comment about that as well and we will discuss at our next meeting (TBD).
For anyone interested in being a part of the developers group :
I’ve created a Doodle poll to help us select a date and time for the Developers Kickoff call: https://doodle.com/poll/efcfv34fws27bav7. Please note that this is not mandatory for everyone, but just for those interested in the software development side of this project. Vote for the dates/times in which you can attend.
Hi Clair, thanks for your efforts here. Has there been any further thoughts on what CDM versions will be supported by the tool? CDM versions 6.0 and 5.3.1 were mentioned as most likely to be supported at the meeting.
For now the target will be CDM v5.3.1. CDM v6.0 is available but ATLAS and other tools do not support it yet so there has been slow adoption of that version. With that in mind, CDM v5.3.1 seems the best option for now for the DQ dashboard with a look to support CDM v6.0 in the future.
There will be a data quality check design meeting on Monday, July 8th at 12:00pm eastern. The goal for this meeting is to finalize the list of checks we would like performed so please come with questions on the existing checks (listed in the above google doc) or with any additional checks that should be added.
Clair
Note - I sent the meeting invitation out over email. If you did not get one and would like to attend please send me a direct message here or email me at mblacke@its.jnj.com
Thanks to everyone who joined the design meeting on Monday. We had a very productive discussion (recording available here) and decided to meet again this Friday, July 12th at 9am eastern. Invites have gone out but I am happy to add anyone who would like to join - see email address above.
@aldirjr yes, thank you for the reminder! We have moved everything to our github where we have our first version of the tool that will be demo’d at this year’s symposium: https://github.com/OHDSI/dataqualitydashboard
A couple weeks ago we had our first DQD development meeting since the symposium. We brainstormed our goals for the upcoming year and tasks we need to accomplish to achieve those goals:
Goals and Objectives for Data Quality Dashboard (DQD)
Use of the DQD to impact regulatory decision-making
Specifically proving that we have done due-diligence in investigating the quality of our data
Domain-related quality assessment
Specifically proving that we have done due-diligence in investigating the quality of our data in relation to the clinical question being asked
Evaluation of data sources prior to analysis or purchase
Study feasibility assessment in network research
Transparency of decisions around thresholds and choices made
Temporal DQD results assessment, change over time
Within a source
Within a network
Tasks
Persistence of the DQD results such that they can be built into a study
Minor change to requirements of runDQD call
Add to skeleton study
Testing of cohort-run of DQD
Addition of more rules
Vojtech to volunteer for that
Goal #1 dependent on cohort DQD task
We will be meeting every two weeks on Fridays at 3pm eastern. Please contact me if you would like to join the discussion.