@plpo, friends:
This subject comes up all the time and is subject of many debates. We need to nail this for good. I have a very strong opinion, and that is to keep the observation periods as high quality as possible, even if we lose data. Three reasons:
1) Data presence should be reliable, and data absence should be reliable
Observational data are analyzed with the assumption that two (unspoken) axioms are true:
A. If some relevant medical event happens there is a record,
B. If there is no record nothing happened. Which means, we don’t have negative data in there (like “patient didn’t have an MI”).
Only if these two axioms hold, our typical calculations of incidence and prevalence rates work. And we use them all the time, 80% of all studies calculate rates and ratios of rates (relative risks). Axiom A data are in the numerator, Axiom B data (patients without data) are in the denominator.
Axiom A holds true if (i) the patient is observed and (ii) the event is worthy of recording: MIs are worthy, itch on the back is not, and everything else is inbetween. Axiom B holds true if the patient is observed. Either misclassification will distort the rates.
2) Medical History information is low quality
In most cases, that information is obtained from the patient, not from some record. And everybody who ever went to the doctor (i.e. everybody) knows the fidelity of the answers to the questions from the doctor. As a result, physicians put very little significance to these listings of prior problems, at best in order to not miss a potential differential diagnosis. Family history is even worse (“Do you have any diseases in your family?” - “Oh yes, my husband has trouble falling asleep lately”). That kind of thing.
3) We need not follow the hoarding reflex any longer
Folks feel they have to preserve every little snippet of data for every patient. However, we now have a network of more than one Billion patients in the OHDSI network. We can afford losing a few data, if we can improve the quality in return. Observational research has a reproducibility and quality problem, not a sample size problem.
Therefore, I would recommend the following:
- We create observation periods where we can make the best assumption to have axiom A, and more importantly axiom B, hold true for all three main data tables: CONDITION_OCCURRENCE, DRUG_EXPOSURE and PROCEDURE_OCCURRENCE.
- The rest goes either in the trash, or into some place where we keep medical history information. The separation will allow folks to use these data if really wanted (e.g. for the infamous exclusion criterion use case), but it will not participate in the general cohort definitions by default.
To use some additional flags or PAYER_PLAN_PERIOD to indicate if the Drug camera or the Condition/Procedure camera is on makes no sense to me. For each query, we would have to join this table, only to exclude the exact same data I want to relegate to history anyway. The vast majority of all cohort definitions today would break, and their definition would become a lot more complicated and performance would go down the drain.