(Very long) introduction
The problem with evidence in literature
In my 2016 OHDSI Symposium talk and our paper which is now on arXiv we argue (and provide evidence for) the issues with the current process of observational science, which centers on generating one estimate at a time using a unique study design with unknown reliability and publishing (or not) one estimate at a time:
-
Because each study uses a unique design, and (virtually) never includes an empirical evaluation of that design, the reliability of the results is unsure. For example, a study design may be vulnerable to unmeasured confounding, and therefore produce spurious results, but this fact is not reflected in the p-value or confidence interval (CI). We have found that most likely over half of published findings with p < 0.05 are not statistically significant when assuming realistic operating characteristics. Recently, we showed that at least some studies appear to have near-nominal operating characteristics, for example having the true effects in the 95% CI 95% of the time, but most studies do not, and one must include negative and positive controls in each study to find this out.
-
Because of pervasive publication bias favoring publication of results that are considered āstatistically significantā, many results on small or null effects never get published, and are therefore unavailable to decision makers. Furthermore, because of the hidden multiple testing that comes with publication bias, it has been concluded that most published research findings are wrong.
In conclusion we can say that the reliability of current published evidence is in serious doubt.
The solution
Both in my 2016 OHDSI Symposium talk and our paper which is now on arXiv we propose the following solution:
-
Instead of answering one question at a time, we will answer many at once. As a proof-of-concept, we focused on comparing 17 depression treatments for 22 outcomes of interest in 4 observational databases, resulting in 17,718 hazard ratios. Each estimate is produced using state-of-the-art observational research methodology, including propensity score adjustment, and thoughtfully-designed exposure and outcome definitions.
-
The study design is thoroughly evaluated by including negative and positive control outcomes in each comparison, and the results of this evaluation is incorporated in the estimates through p-value and CI calibration.
By generating all evidence at once, and disseminating this evidence as a whole, we can prevent publication bias. As a consequence, there will be evidence on small and null effects, and there is no hidden multiple testing. By including controls to perform empirical evaluation and calibration we have confidence in the operating characteristics of our study, for example that the truth is within the 95% CI 95% of times. Because of these reasons, we believe our solution will provide an evidence based of much greater reliability than what is in the current scientific literature.
The problem with the solution
In 2018, we aim to practice what we preach: We have initiated LEGEND (Large-scale Evidence Generation and Evaluation in a Network of Databases), a collaboration tasked with generating evidence at large scale, and making this evidence publicly available. LEGEND will produce tens of thousands of effect estimates, each using OHDSIās established best practices. As concluded above, we believe such an evidence base will have a much higher reliability than current published literature. If you have a specific clinical question, you can look up our results in the evidence base, and have high(er) confidence in the reliability of this evidence.
But, we suspect that our solution comes with a potential serious side effect; Because our results come in a nice machine readible format, it is perfectly possible to perform data dredging. Instead of using our evidence as one would use the current literature, which is to find the answer to a specific question, one could simply look at things that appear interesting. You could focus on the smallest p-values after calibration, and make drastic conclusions based on what you find. Or, since we expect to perform quite some sensitivity analysis, one could focus on the analysis that produces the result one wants, disregarding the others. The problem with this cherry picking is that even if we believe a p-value has nominal operating characteristics, that still means that 5% of the results will have p < 0.05 even if the null is true for all. Just because there is no more hidden multiple testing does not mean there is not multiple testing, just that it is now out in the open and should be corrected for when using the data. For example, if I deliberately select the lowest of 10,000 p-values, I really should adjust for 10,000 tests.
To frame it less nefariously: one could use our evidence for hypothesis testing (looking up the answer to a specific question), but also hypothesis generation (finding questions where the answer is āinterestingā). I strongly believe we could provide great value for patients and other stakeholders by supporting hypothesis testing through LEGEND. I suspect we could also provide value via hypothesis generation, but I am not yet sure how to best do that, and there is real danger that people will do hypothesis generation badly, and do more harm than good.
What do people think we should do? Should we throw away the baby with the bathwater, and not generate evidence at scale in order to prevent misuse? Should we enforce proper use of our evidence base, and if so, how? Or should we not worry about this, since the cure is better than the disease?