OHDSI Home | Forums | Wiki | Github

How is missing data handled?


I am preparing a study using the OHDSI ATLAS tool and am wondering how missing data is handled within the study package that is created. For example, in this study one of the cohort’s entry criteria is drug exposure during a certain time period. Then, one of the covariates is hypertension. If this baseline data is missing for some individuals who are included within the cohort, what method is used to handle this missing data in the results produced? I don’t see an option to change the method or discussion within the Book of OHDSI on what methodology is used.

In my previous paper (SCYou et al., JAMA, 2020, https://doi.org/10.1001/jama.2020.16167) I wrote

All variables except laboratory values were binary (yes/ no) and all missing binary variables were considered as not present and coded as no. The missingness in laboratory values were matched and missing values were not imputed.

Thank you! Was that a choice you were able to make in the study package or is that how ATLAS automatically codes it?

I am currently mapping a database into the OHDSI so I can answer this question from that perspective. OHDSI just does not offer the ability to actually store a No as a distinct state from Missing. That information is just lost during the migration even if it exists in the original data.
This honestly felt like a pretty huge issue to me, but apparently everyone is fine with it and successfully using the status-quo in their research, so I have decided to just accept it^^ You as user of the data just need to always keep in mind that instead of the classic, for example, ‘Took drug’, ‘Did not take drug’ and ‘We do not know’ states you will only have a ‘Definitely took drug’ state and a ‘Maybe took drug’ state.

@christoph.blapp et al.:

There is no such a thing as a missing value in the OMOP CDM. Why? Because observational data just project what is happening in the normal course of healthcare. So, the two assumption of observational data are:

  • If something happened there is a record,
  • If there is no record nothing happened.

You can only have a missing value if you expect one. And you only do that in a controlled clinical trial where you predefine and protocolize exactly the course of events. If you want to process clinical trial data you will have to add those predefined items somehow. There is a clinical trial working group who might have solved that problem.

1 Like

Interesting point that would apply to events, but not to permanent traits - for example race and ethnicity may be missing in an EHR record

Thanks for pointing me towards that working group. But I disagree on the broad statements about observational data. I am mapping a specialised registry that collects data generated through normal clinical practice, so ‘observational’ feels like a much more fitting term than ‘trial’, but our experiences do not fit what you describe.
The easiest example for missing data generated by us would be within the area of medications. We only ask doctors to share information with us regarding specific treatments relevant to our area of specialisation. Among anything within that selection we are confident about that a lack of a record means just what you describe. For any drug outside of that we have no idea though. Our patients have no entries for, for example, depression drugs in OMOP. but very much not because those patients never took such drugs. We just do not ask doctors to share that information with us, since mental health is not within our area of stewardship.

Now it is a little clearer what you are up to. Yes, registry data are limited, and by definition not covering events comprehensively. The typical observational data EHR and claims are supposed to. They, also, are not perfect.

But still. Except the MEASUREMENT and OBSERVATION tables we cannot have records indicating absence of a fact. And that is what the methods expect, and how we calculate rates.

Your use case: Well, you either know there should be a record, or you don’t know that. In the former, you should just write it. in the latter, not much you can do. The missing value doesn’t help you at all.

Thanks for the clarification. The specialisation towards those two flavours of observational data should probably be more actively communicated, I have by now heard people be like ‘Let’s just all map to OMOP, it can handle everything!’ from multiple directions…
OHDSI seems to be too good at marketing itself^^