OHDSI Home | Forums | Wiki | Github

OMOP CDM - patient data quality flag

Dear all,

THEMIS related topic, hope I am posting it in the right location. The topic has somewhat been beaten up at this point and but wanted to take it to forums as it keeps coming up in our interactions with multiple OMOP CDM users.

When doing OMOP CDM conversions, we are filtering out data with insufficient quality. Specifically, in CPRD that would be those patients marked with accept = 0. However, filtering data our makes a lot of CPRD users uncomfortable as they feel they would be “losing” a big number of records that they are used to seeing in their cohorts, even knowing those are not a good quality records.

The idea, of course, is that only trustworthy data should be used to ensure quality observational research. However, I understand that there are a few cases - like with open claims (unadjudicated) - where data with insufficient quality is also useful.

Wanted to have a quick discussion here again, brainstorm possible ideas:

  1. Are there use case where patients (in CPRD with accept = 0 flag) with insufficient quality can be used?

  2. Let’s say #1 is yes, then would it make sense to have a flag in OMOP CDM marking patients as “acceptable quality”? Then maybe having ATLAS able to use this flag in the cohort builder, where by default it would only use records with “acceptable quality”. For those who decide to filter out records as a part of ETL, that means no change at all as all records would be marked as “good”. This idea of having a flag does breakdown, somewhat, in the example of open claims.

Another idea that was proposed is to have another instance with CPRD data with “insufficient quality” patient records.

tagging some OHDSI heavyweights - @Christian_Reich, @Patrick_Ryan, @ericaVoss, @Rijnbeek

1 Like


I think this issue has a recommendations out for OHDSI community review:
Person Inclusion

We have been trying to figure out how to handle comfort with patient loss. For example, in our claims data sets we remove people without gender and people get uncomfortable when they see the raw data has more patients than the CDM. However when you review their studies the first thing they often do is say only select patients who are male and/or female. Most of the data changes we implement into our CDM builds are things people are typically going to do when leveraging the data, we just want to do it in a standard way with the CDM build instead of leaving it up to the analyst to remember or know how to do it.

One way we want to tackle this is use Metadata to describe what data is lost and why (@Ajit_Londhe has started a WG for this). In other words our CDM Builder has a handful of reasons it drops data, we want to quantify how much data has been lost and for what reasons (e.g. 1000 patients were dropped because their gender was unknown). Then when working with analyst who is concerned about data loss it is easier to explain why and what, it makes the whole process less "black box"ish. If we believe there is certain data that shouldn’t be used for research then we can do the analysts a favor by handling it before they encounter it.

This is not to say that our decisions on how to handle the data apply to all groups doing research (which gets back to how THEMIS worded the recommendation). In our organization the bulk of the research is of a certain type, so we tailor our CDMs for that research. But making clear the decisions made through your ETL document and in the near future through Metadata is a requirement.

To answer your questions:

  1. We currently do not use CPRD patients where ACCEPT = 0. Would be interested in understanding use cases of others.
  2. If you want to build a CDM with both patient types I would just add the flag to the OBSERVATION table so you can still know who is who. We do something similar for Truven CCAE and Mental Health and Substance Abuse flags. Then you could use ATLAS.

@jennareps is our resident CPRD expert, tagging her to see if she has any other feelings on the matter.



I am not sure why this keeps coming back like a counterfeit dollar bill. I also don’t understand why we have to optimize the “comfort” and “feelings” of people. Our job is to create accurate and reproducible answers to questions. And we have built a whole ideology around it:

  • Use standard software QA principles when preparing data and building methods and tools
  • Make all data manipulations and all methods public, easy to reproduce and transparent
  • Compare answers generated from disparate data using disparate methods and parameters
  • Quantify the degree of confidence they give us using known facts (negative and positive controls)

Please please please let us use our energy to promote these principles.

I must say I have some empathy why people are clinging to this “No patient left behind” mantra. I hear two reasons why they are needed:

  1. As a quality test of data manipulations (e.g. ETL)
  2. Lack of sample size is our deadly enemy.

Both are bullshit. Time to put them to bed.

The first one is a remnant from Clinical Trial regulations: Only following (monitoring as they call it) each datum of each patient will guarantee full quality. In our world that is a fallacy. The data were wrong before we touched them. They violate axiom #1 (if something happened to the patient it is recorded), and they violate even more axiom #2 (if nothing is recorded, nothing happened). And yet we still happily calculate incidence rates as if the data were the Word passed down from the Lord to the Prophet. Remember: We don’t want perfect data, we want perfect answers to our questions. And if the community decides that certain attributes indicate a data quality problem or otherwise screw up our methods (like missing age) we can safely remove these records without changing the accuracy of the answer. If they did change the answer we will catch that using the principles above and start debugging.

The second one, unless we got unlucky studying a truly rare effect, stopped being a problem long ago: We have large databases now. And they are getting bigger and more beautiful every day. We are not that destitute that we have to look for scraps of potatoes in the garbage. Let’s work on the enormous number of questions we can tackle and generate answers today.

The data quality flag in CPRD (same exists in THIN): It’s a very good example for a data quality rule. Until the practices no longer fail some predefined heuristic their data are flagged. I wish they had just jettisoned them during data production and nobody would have even made a bleep.