OHDSI Home | Forums | Wiki | Github

How is missing data handled?


I am preparing a study using the OHDSI ATLAS tool and am wondering how missing data is handled within the study package that is created. For example, in this study one of the cohort’s entry criteria is drug exposure during a certain time period. Then, one of the covariates is hypertension. If this baseline data is missing for some individuals who are included within the cohort, what method is used to handle this missing data in the results produced? I don’t see an option to change the method or discussion within the Book of OHDSI on what methodology is used.

In my previous paper (SCYou et al., JAMA, 2020, https://doi.org/10.1001/jama.2020.16167) I wrote

All variables except laboratory values were binary (yes/ no) and all missing binary variables were considered as not present and coded as no. The missingness in laboratory values were matched and missing values were not imputed.

Thank you! Was that a choice you were able to make in the study package or is that how ATLAS automatically codes it?

I am currently mapping a database into the OHDSI so I can answer this question from that perspective. OHDSI just does not offer the ability to actually store a No as a distinct state from Missing. That information is just lost during the migration even if it exists in the original data.
This honestly felt like a pretty huge issue to me, but apparently everyone is fine with it and successfully using the status-quo in their research, so I have decided to just accept it^^ You as user of the data just need to always keep in mind that instead of the classic, for example, ‘Took drug’, ‘Did not take drug’ and ‘We do not know’ states you will only have a ‘Definitely took drug’ state and a ‘Maybe took drug’ state.

@christoph.blapp et al.:

There is no such a thing as a missing value in the OMOP CDM. Why? Because observational data just project what is happening in the normal course of healthcare. So, the two assumption of observational data are:

  • If something happened there is a record,
  • If there is no record nothing happened.

You can only have a missing value if you expect one. And you only do that in a controlled clinical trial where you predefine and protocolize exactly the course of events. If you want to process clinical trial data you will have to add those predefined items somehow. There is a clinical trial working group who might have solved that problem.

1 Like

Interesting point that would apply to events, but not to permanent traits - for example race and ethnicity may be missing in an EHR record

Thanks for pointing me towards that working group. But I disagree on the broad statements about observational data. I am mapping a specialised registry that collects data generated through normal clinical practice, so ‘observational’ feels like a much more fitting term than ‘trial’, but our experiences do not fit what you describe.
The easiest example for missing data generated by us would be within the area of medications. We only ask doctors to share information with us regarding specific treatments relevant to our area of specialisation. Among anything within that selection we are confident about that a lack of a record means just what you describe. For any drug outside of that we have no idea though. Our patients have no entries for, for example, depression drugs in OMOP. but very much not because those patients never took such drugs. We just do not ask doctors to share that information with us, since mental health is not within our area of stewardship.

Now it is a little clearer what you are up to. Yes, registry data are limited, and by definition not covering events comprehensively. The typical observational data EHR and claims are supposed to. They, also, are not perfect.

But still. Except the MEASUREMENT and OBSERVATION tables we cannot have records indicating absence of a fact. And that is what the methods expect, and how we calculate rates.

Your use case: Well, you either know there should be a record, or you don’t know that. In the former, you should just write it. in the latter, not much you can do. The missing value doesn’t help you at all.

Thanks for the clarification. The specialisation towards those two flavours of observational data should probably be more actively communicated, I have by now heard people be like ‘Let’s just all map to OMOP, it can handle everything!’ from multiple directions…
OHDSI seems to be too good at marketing itself^^

@tlasky: Race and ethnicity are tricky in many ways. But you are right. We have to clean up permanent attributes, like germline variants.

Some data fields in healthcare are missing. This is common in demographic data where a patient does not report ethnicity or marital status, for examples. Missing values are important when a value exists in reality but it is missing in the source data. When this occurs we typically require non-missing demographics

Out of my own curiosity, could you explain how you’d account for the missing information that you asked the doctors not to share? In an alternative data model, how did you capture this information? And finally, how was it leveraged in your analysis?

The main benefit of not asking for everything is that doctors are much more willing to actually spend the effort on making information accessible if you restrict yourselves to a curated list they see a direct public use in. Our registry relies on doctors voluntarily spending their time. There are no enforcement mechanisms and the only reward is the ability to use that data in their own research, which only a few ever actually do.
We only ‘account’ for it by making sure no one is trying to use our data for research questions that would require other drugs and having a structure where it is fairly obvious that not everything is covered. It is not literally a separate column for each drug we care about, but that mental image is not far off.
Those analysis we then do that use the data which is on that curated list then benefits from more and higher quality data due to doctor motivation not being sapped by stuff they consider useless.

I totally understand that my perspective is totally different from someone using automation to just grab piles of data generated as a side-effect of other operations.

Thanks for clarification. Just so I understand: you’re not accounting for the missing data in any element of the model, but rather there’s an understanding of ‘fit for use’ for certain questions on your dataset.

In OHDSI, there are several tools that allow you to characterize populations (cohorts or database-wide) that give visibility into what is captured in the data. Using these functions, you can see if certain diagnosis codes are captured, or drug exposures appear in the data, etc. In our own internal studies, we work like you describe: there’s an understanding of the data that informs us if we want to use the dataset to answer a specific question. For example, we won’t use Medicare (where 99% of people are >= 65) to study pediatrics. There is nothing in the data model to indicate ‘no pediatrics’, but we can use the datasource documentation + database characterization to determine usability.

Thank you again for your perspective, but I’ll continue to wait for an example from anyone in the community that has an example of capturing ‘missing data’ and how it would be leveraged in an analysis.

Mmh, how do you deal with rare drugs? In our cohort we have medications that we actively track but that only a single-digit number of our patients ever resorted to. If we were a bit smaller it would be perfectly possible for us to have zero entries for these drugs. They would look identical to medications we do not track despite us having information about them.
This is something that we do not do for drugs (but actually do for what OMOP considers conditions): How would you deal with medications that are only tracked if certain other factors are true. Like, imagine if we only asked for information regarding a certain drug for pregnant patients or patients with specific sub-diagnoses. A structure like that would likely not be detected just by cohort characterisations.

Is that really missing, @christoph.blapp? “Missing values” assumes they are there, but we haven’t captured them. If the drugs are not used there is nothing missing, except maybe a missed chance for some patients who might benefit.

There are a few of those popping up from time to time, @Chris_Knoll.

  • Clinical trial data, with its strict control over what activity (visit, exposure, condition) happens during the course of the study. So, in a clinical trial, if the patient doesn’t show up for the 3rd visit there will be a record indicating that. In OMOP, there just won’t be a record.
  • Results of diagnostic procedures, for example this one, but similar popping up all the time. Right now, we can only capture the result if it is positive. If the result is negative we have to omit the result, like it never happened.

Both are negative assertions. Our closed world model cannot distinguish between things not happened or things happened but not captured. We would drown if we tried doing that. And I can’t think of an example for a positive assertion where missing data are a problem (other than a quality problem).

Use cases: I am waiting as well. Unless you want to explicitly do clinical trial analytics (OMOP probably the wrong place) or the effectiveness of screening methods we can live very well without.