OHDSI Home | Forums | Wiki | Github

The promise of large-scale evidence generation and the danger of data dredging

(Very long) introduction

The problem with evidence in literature

In my 2016 OHDSI Symposium talk and our paper which is now on arXiv we argue (and provide evidence for) the issues with the current process of observational science, which centers on generating one estimate at a time using a unique study design with unknown reliability and publishing (or not) one estimate at a time:

  1. Because each study uses a unique design, and (virtually) never includes an empirical evaluation of that design, the reliability of the results is unsure. For example, a study design may be vulnerable to unmeasured confounding, and therefore produce spurious results, but this fact is not reflected in the p-value or confidence interval (CI). We have found that most likely over half of published findings with p < 0.05 are not statistically significant when assuming realistic operating characteristics. Recently, we showed that at least some studies appear to have near-nominal operating characteristics, for example having the true effects in the 95% CI 95% of the time, but most studies do not, and one must include negative and positive controls in each study to find this out.

  2. Because of pervasive publication bias favoring publication of results that are considered ‘statistically significant’, many results on small or null effects never get published, and are therefore unavailable to decision makers. Furthermore, because of the hidden multiple testing that comes with publication bias, it has been concluded that most published research findings are wrong.

In conclusion we can say that the reliability of current published evidence is in serious doubt.

The solution

Both in my 2016 OHDSI Symposium talk and our paper which is now on arXiv we propose the following solution:

  • Instead of answering one question at a time, we will answer many at once. As a proof-of-concept, we focused on comparing 17 depression treatments for 22 outcomes of interest in 4 observational databases, resulting in 17,718 hazard ratios. Each estimate is produced using state-of-the-art observational research methodology, including propensity score adjustment, and thoughtfully-designed exposure and outcome definitions.

  • The study design is thoroughly evaluated by including negative and positive control outcomes in each comparison, and the results of this evaluation is incorporated in the estimates through p-value and CI calibration.

By generating all evidence at once, and disseminating this evidence as a whole, we can prevent publication bias. As a consequence, there will be evidence on small and null effects, and there is no hidden multiple testing. By including controls to perform empirical evaluation and calibration we have confidence in the operating characteristics of our study, for example that the truth is within the 95% CI 95% of times. Because of these reasons, we believe our solution will provide an evidence based of much greater reliability than what is in the current scientific literature.

The problem with the solution

In 2018, we aim to practice what we preach: We have initiated LEGEND (Large-scale Evidence Generation and Evaluation in a Network of Databases), a collaboration tasked with generating evidence at large scale, and making this evidence publicly available. LEGEND will produce tens of thousands of effect estimates, each using OHDSI’s established best practices. As concluded above, we believe such an evidence base will have a much higher reliability than current published literature. If you have a specific clinical question, you can look up our results in the evidence base, and have high(er) confidence in the reliability of this evidence.

But, we suspect that our solution comes with a potential serious side effect; Because our results come in a nice machine readible format, it is perfectly possible to perform data dredging. Instead of using our evidence as one would use the current literature, which is to find the answer to a specific question, one could simply look at things that appear interesting. You could focus on the smallest p-values after calibration, and make drastic conclusions based on what you find. Or, since we expect to perform quite some sensitivity analysis, one could focus on the analysis that produces the result one wants, disregarding the others. The problem with this cherry picking is that even if we believe a p-value has nominal operating characteristics, that still means that 5% of the results will have p < 0.05 even if the null is true for all. Just because there is no more hidden multiple testing does not mean there is not multiple testing, just that it is now out in the open and should be corrected for when using the data. For example, if I deliberately select the lowest of 10,000 p-values, I really should adjust for 10,000 tests.

To frame it less nefariously: one could use our evidence for hypothesis testing (looking up the answer to a specific question), but also hypothesis generation (finding questions where the answer is ‘interesting’). I strongly believe we could provide great value for patients and other stakeholders by supporting hypothesis testing through LEGEND. I suspect we could also provide value via hypothesis generation, but I am not yet sure how to best do that, and there is real danger that people will do hypothesis generation badly, and do more harm than good.

What do people think we should do? Should we throw away the baby with the bathwater, and not generate evidence at scale in order to prevent misuse? Should we enforce proper use of our evidence base, and if so, how? Or should we not worry about this, since the cure is better than the disease?

3 Likes

Obviously we should generate the evidence. If we don’t, someone will, and probably not as well. I think the results, used properly, will be useful.

We could do the dredging for them and prove that it doesn’t work. (But also prove that used properly, it performs well.) And develop criteria for what constitutes dredging. Maybe also apply those same criteria to studies parsed from the literature to show that the literature is no better or even worse in terms of vulnerability to dredging (once you parse it and make it available en masse).

George

Define significance in terms of both effect sizes and p values.

Setting a threshold for effect size at some level above “small” will solve part of the problem you raise by making it less likely that 5% of results will be both moderately large and have a p < .05. It will also focus attention appropriately away from results that are statistically significant but clinical unimportant. Most importantly it eliminates the range of effects that are least likely to be reproducible even when analyzed appropriately, thus strengthening the claim to reproducibility.

Though much better nominal coverage of p values at < .05 is better than current evidence publication standards, it addresses a piece of the reproducibility problem that remains problematic even when done correctly. Your (and George’s, Patrick’s, David’s and Marc’s) brilliant work on empirical calibration of effect sizes and CIs, would seem to allow an approach for generating evidence at scale without the highly problematic conflation of statistical significance and clinical importance.

A reasonable approach to defining small for most results could be tricky to define. Case by case definitions would be challenging to do at scale. And static thresholds for moderate effects will vary in their appropriateness across cases. But I think there is a strong argument for choosing a somewhat artificial effect size limit that fails to appropriately credit some effects which are important when small vs. crediting lots of small effects that are both less likely to be clinically important and less likely to be reproducible.

I hesitate to respond to this because I am not familiar with the details of the approach so I don’t want to be criticizing it. And it is always easy for people to say “you should have done it differently”. So, with that in mind, after reading just this post, I wanted to pass along this comment for whatever it is worth.

I would argue that this is not what people want. What they want is the probability that the parameter (hazard ratio) is >= to some clinically important value related to risk, in a particular population.

I would also argue that this approach appears to be predicated on pooling and clustering of estimates.

Both of these point to using Bayesian estimation. Which avoids the p-value problem and focuses on estimation of risk ratios.

Thanks! Both @Andrew and @Mark_Danese stress the importance of ‘clinical significance’ over ‘statistical significance’. I comletely agree with what they said, and apologize if my original post suggested we should focus on p-values and statistical signficance.

However, when starting this discussion I hoped to raise a different point. No matter what your output of interest, be it a p-value, point estimate + confidence interval of the hazard ratio, posterior probability of effect size > x, lower bound of the 95% credible interval > y, or some yet to be defined statistic, they all have one thing in common: they express uncertainty. This uncertainty in one way or the other attempts to help understand the probability of making a wrong conclusion. For example, assume we find the posterior probability that RR >= 2 is 95%. If that is the only result we produced, we may decide to act on the assumption that RR >= 2, and accept the 5% risk of taking the wrong action.

But things are not that simple when we consider many effect size estimates. If I produce 10,000 estimates, and from that list select the one with the highest posterior probability that RR >= 2, if that highest posterior probability is equal to 95% that should be viewed very differently than if it was the result of a single study. We can easily fall pray to the Texas sharpshooter fallacy: we may think that a 95% probability is very high, but we had 10,000 shots at the target, so getting 95% by random chance is really not unlikely at all. Many small uncertainties combine to very large uncertainty.

The story is easier told when using p-values as an example: for 10,000 estimates, I would expect 500 to have a p < 0.05 (50 to have a p < 0.005 if you want to go there) even if the null is true for all. If I then present those 500 (or 50) without telling you I looked at the other 9,500 (or 9,950), you would be misleaded when trying to interpret those p-values.

There is of course lots of literature on how to deal with multiple testing. But my worry is that people may not be aware that they’re doing multiple testing when simply browsing through our evidence.

You are correct – anytime someone makes a decision with a 5% error rate, there will be 5% of estimates that are wrong (and it won’t be clear which are the wrong ones). I don’t think there is any way of getting around that.

My suggestion on using Bayesian estimation is to provide the 95% credible intervals for the risk parameters of interest. In that way, there is no hypothesis to test, there is no thinking about 95% of intervals constructed in the same way containing the true estimate, and there is nothing to be “wrong” about. The credible intervals summarize the evidence about the risks of interest. The intervals may, or may not, include 1.0 in them, but that is beside the point.

But I am, by far, the least qualified person to discuss Bayesian philosophy. I am just suggesting another approach to presenting the evidence from a very large collection of analyses.

What do people think we should do?
Should we throw away the baby with the bathwater, and not generate evidence at scale in order to prevent misuse?

I think you should try to do LEGEND and publish the output.

Should we enforce proper use of our evidence base, and if so, how?

Yes. By stating all the limitations - as disclaimer with the provided data.

Or should we not worry about this, since the cure is better than the disease?

Try to write up the best disclaimer but don’t agonize over “I could have written a better one”.

@schuemie your post was clear. I apologize if mine muddied the waters. My suggestion was meant to help address one piece of an anti-dredging solution that I assume will have to be multipronged.

As you know, most effects that fail to replicate are small, i.e. the distribution of replication failures over effect sizes is very strongly skewed to the left. Distinguishing small from other effects in the rule for marking results as “significant” in LEGEND’s evidence base should reduce, though not eliminate, the danger that an interestingly large result has a low probability of replication.

This is especially true if “replication” is defined liberally - an effect with the same sign - rather than strictly - an effect within the 95% CI. My guess is that this liberal definition maps better to many questions that people look to evidence for, - “Is a more effective (or harmful) than b?” - than a stricter definition - “Is a x amount more effective than b?”. The p-rep statistic (Killeen Psych Science 2005) lends itself to this use.

Is your concern about this risk more purely related to results being machine-readable? In that case, maybe markers of results’ significance could differentiate how they are read by machines. Maybe have p values for small effects in different fields or represented as strings rather than numeric values. That wouldn’t pose much of a coding challenge for someone intent on scraping, obviously, but maybe it would prompt the coder to think a bit more about why they are represented differently.

Another partial solution for human readers might be to link each result to published related evidence. And - as if that weren’t enough work - the process for retrieving results through that link might include guidance on how to evaluate the implications.

Forwarding this message from Andrew Gelman (Columbia University):

Dear Martijn:

Your post was forwarded to me by Mark Tuttle (see below). I have two thoughts:

  1. I agree that researchers should be answering many parallel questions at once, rather than studying hypotheses one at a time.
  1. But I disagree that sets of studies should be analyzed using “p-value and CI calibration.” There’s a big problem with multiple comparisons adjustments and p-values more generally, which is that a p-value is a noisy summary of data, and a huge amount of information is lost by setting any threshold. If you are picking effects based on whether p-values exceed a threshold or confidence intervals exclude zero, then you’re just adding lots of noise to the situation.

Instead, I recommend multilevel modeling as discussed in this paper from 2012:
http://www.stat.columbia.edu/~gelman/research/published/multiple2f.pdf

“Data dredging” is not such a problem if you report and analyze everything.

I hope this is helpful. You can feel free to share this message with others on your list.

Yours
Andrew

1 Like

Thanks @Andrew! Yes, you’re absolutely right that reproducibility is likely to be a bigger problem for smaller effect sizes.

I also fully agree that we need to provide background literature on effect sizes where possible. In fact, @ericaVoss and others have been working quietly to populate the Common Evidence Model to allow us to link our results to evidence in literature, spontaneous reports, and product labels. One goal is the present our results in a holistic interface, following @Patrick_Ryan’s vision of HOMER.

1 Like

(Here was my separate email back to Mark referring to Andrew Gs email.)

I think that’s what we are doing. We publish the raw evidence as CIs. We don’t decide what to publish based on thresholds. Then you pick a hypothesis that you care about for other reasons, and use only that section of the results.

How you incorporate the evidence is up to you. If you want to use a traditional frequentist approach, fine, correct for the right number of multiple hypotheses. Or you can use a model like Andrew suggests. Martijn is not picking the first over the second.

I think Andrew thinks Martijn is using calibration to address multiple hypotheses. He is not. He is using it to address unmeasured confounding. I don’t think Andrew’s hierarchical Bayesian model is addressing unmeasured confounding (if nothing else because he is not incorporating negative controls or any other way to detect unmeasured confounding). Andrew uses the word “calibration” in his paper to refer to adjusting the p-value for multiple hypotheses; that is not Martijn.

Sanat Sarkar has pointed out to me that if one wanted to account for
multiplicity (e.g. FDR control or simultaneous intervals), there are
dependencies within the some-by-some (or all-by-all) that one would have to
account for. Anyone working on this?

Not to my knowledge. You are of course correct: many estimates we produce would be statistically dependent. For example, estimates involving the same target and outcome (and only varying in the comparator cohort) would re-use much of the same target population. Similarly, and less obvious, after empirical calibration estimates for the same target-comparator-outcome across databases will be correlated because the systematic error accounted for in the calibration is likely to be correlated.

Just another perspective on this – Jeff Blume has developed “second generation p-values” which might be interesting in this context. http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0188299

Abstract below:

Verifying that a statistically significant result is scientifically meaningful is not only good scientific practice, it is a natural way to control the Type I error rate. Here we introduce a novel extension of the p-value—a second-generation p-value (pδ)–that formally accounts for scientific relevance and leverages this natural Type I Error control. The approach relies on a pre-specified interval null hypothesis that represents the collection of effect sizes that are scientifically uninteresting or are practically null. The second-generation p-value is the proportion of data-supported hypotheses that are also null hypotheses. As such, second-generation p-values indicate when the data are compatible with null hypotheses (pδ = 1), or with alternative hypotheses (pδ = 0), or when the data are inconclusive (0 < pδ < 1). Moreover, second-generation p-values provide a proper scientific adjustment for multiple comparisons and reduce false discovery rates. This is an advance for environments rich in data, where traditional p-value adjustments are needlessly punitive. Second-generation p-values promote transparency, rigor and reproducibility of scientific results by a priori specifying which candidate hypotheses are practically meaningful and by providing a more reliable statistical summary of when the data are compatible with alternative or null hypotheses.

Thank you @schuemie for this exciting and important work!

Here’s an idea for creating an empirically-derived reproducibility grading system for each result in LEGEND.

These quantifiable attributes of each LEGEND result affect reproducibility:
Effect size (RR)
Power (e.g. post hoc power to detect an RR of the size used distinguish small from > small effects at alpha= .05 after FDR adjustment)
Measurement error (e.g. consistency of target, comparator and outcome definitions across comparable OMOP datasets)
Consistency of effect replication across OMOP datasets (e.g. large high quality DBs other than those you used - VA, Pan European, Korean, Chinese… ).

High quality published trials, as identified by the amazing @ericaVoss and co, exist for a subset of LEGEND results. This subset could be used to define a labeled training set. After deciding on a reasonable definition of successful replication, using one’s favorite regression/ML approach, one could derive optimized weights for each factor (effect size, power, etc.) as predictors of successful replication.

Those weights would form the basis of an empirically derived “reproducibility” grading system that could be applied to each LEGEND result regardless of whether relevant RCTs exist for that result. That system could be used in addition to, or instead of, MT-adjusted p values. The model and resulting grades could be updated periodically as more RCTs are published.

This seems like a potentially useful innovation in representing evidence results. If it seems worthwhile, I’d love to collaborate on it with you.

1 Like

Thanks @Andrew! I think we’ve moved from the fear of data-dredging to the question how to do signal generation appropriately.

I think orthogonal to your criteria, others were suggested by @anthonysena et al. in their poster on active surveillance:

  • Estimate: The amount of evidence suggesting a positive association. Low = 0 database, Medium =
    at least 1 and High = all databases.

  • Incidence: Based on the CIOMS III working group categorization of adverse reaction frequency:
    Very Common (≥ 10%), Common (≥ 1% and < 10%), Uncommon (≥ 0.1% and < 1%), Rare (≥
    0.01% and < 0.1%) and Very rare (< 0.01%)

  • Seriousness: A relative measure of severity of the disease based on health service utilization
    before and after incident outcome.

I would also like to call out the poster by @pnatsiavas et al, that looks at integrating evidence from observational data with evidence from other sources, including clinical trials, social media, spontaneous reports, and scientific literature.

Thanks @Mark_Danese! The 2nd generation p-value is a very interesting concept, with clear advantages over the 1st generation p-value. However, I’m not yet sure why they claim that

second-generation p-values provide a proper scientific adjustment for multiple comparisons

Just because they are more conservative (in most situations) doesn’t mean they work well as the number of comparisons increases. But I probably need to read the paper again.

Sorry to veer off topic. Thanks for the leads. I’ll check to see whether @ericaVoss, @anthonysena, @pnatsiavas or others are interested in pursuing this idea.

In this case, the p-value is not probability, it is a measure of the degree of overlap with a pre-specified “unimportant” interval. At least that is what the author says here: https://www.statisticalevidence.com/second-generation-p-values (see bullet point 7). If it is interesting, I am sure they would respond to emails.

I know Jeff focuses on likelihood based methods (between frequentist and bayesian) in the way of Richard Royall. I am guessing that this is related to his work on likelihood ratios, which is another thing to consider: https://github.com/StatEvidence/website/blob/master/Blume2002-Tutorial.pdf

I am sorry I am not able to do a better job of summarizing all the issues, but I hope they help with some other ideas.

I agree with @Vojtech_Huser, the answer is yes and we should state the limitations. Maybe we can take a page from the YODA project, there is a gatekeeper that allows access once it is determine there is an appropriate ask of the data.

1 Like
t