I would argue that because real negative controls likely have strong unmeasured confounding makes them ideal to evaluate methods! We want to make evaluate how well methods perform in the real world, not in a simulated ideal world (note that @aschuler’s approach also introduces unmeasured confounding).
One very important thing I realize we haven’t discussed: should we evaluate methods that try to quantify risk attributable to an exposure, or methods for comparative effectiveness. In other words, methods tend to answer one of these questions:
What is the change in risk of outcome X due to exposure to A?
What is the change in risk of outcome X due to exposure to A compared to exposure to B?
Question 1 can often be answered by reformulating it as question 2 by picking a comparator believed to have no effect on the risk. For example, in our Keppra and angioedema study we picked phenytoin as a comparator because we were certain it did not cause angioedema, allowing us to estimate the effect of Keppra.
I must confess I’m mostly interested in question 1, since comparative effectiveness methods can be viewed as answering question 1 by picking a ‘null comparator’ as argued above. But we could create two gold standards, one for question 1 methods and one for question 2 methods.
@aschuler, there is at least one thing we can do to evaluate unmeasured confounding: we can compare an evaluation using true negative controls to an evaluation using your simulation framework where the relative risk is 1 (no effect). If the simulation procedure is realistic enough, those two evaluations should generate the same results.
Based on @saradempster’s suggestion I’ve created a template protocol for establishing the benchmark. I hope everyone will join in filling in this protocol!
You can find the link to the protocol template in this topic.
Just thinking further about metrics for assessing CIs.
If we are really interested in effect estimation, then want confidence intervals w.r.t. true value:
coverage
mean CI width
variance of CI width
bias (point estimate or CI midpoint versus true value)
see Kang and Schmeiser CI scatterplots [1] (e.g., CI half width versus midpoint)
(they are much like Martijn’s scatter plots)
If we want to discover associations, then we want confidence intervals w.r.t. no effect (1), and the true value is irrelevant other than its direction:
this is really just a hypothesis test (p-value)
specificity is set at .95 (95% coverage of negative controls after calibration)
sensitivity is proportion of excluding no-effect (1) for positive controls
can derive relation of sensitivity to CI: (CIwidth / 2) < EffectSize - 1
ROC area calculated based on point estimate of specificity and sensitivity
(or perhaps could generate curve by altering alpha .2, .1, .05, .03, .01)
Just noticing that when we do p-value calibration and report coverage, we really should also report power on positive controls.
Keebom Kang, Bruce Schmeiser, (1990) Graphical Methods for Evaluating and Comparing Confidence-Interval Procedures. Operations Research 38(3):546-553. http://dx.doi.org/10.1287/opre.38.3.546