Concern about the construction of the positive controls in the empirical CI calibration paper

rosa.gini · May 28, 2018, 3:55pm

dear @schuemie , @Patrick_Ryan , and all,

i have been reading thoroughly the paper empirical CI calibration by martijn et al, and i have a concern that i hope you can address: it looks to me as if the assumption beyond the construction of positive controls is equivalent to assuming that the outcome variable is measured without error.

this is the statement that puzzles me.

“We exploit the negative controls to construct synthetic positive controls by injecting simulated outcomes during exposure. For example, assume that, during exposure to dabigatran, n occurrences of ingrowing nail were observed. If we now add an additional n simulated occurrences during exposure, we have doubled the risk. Since this was a negative control, the relative risk compared with warfarin was one, but after injection, it becomes two.”

to my understanding, the idea here is that since n cases are observed in exposed, and the risk in exposed and non exposed should be the same (because it’s a negative control), then injecting n simulated occurrences during exposure should produce a new ‘true’ variable (half original and half simulated) which has a double risk among exposed wrt to the negative control, and therefore a ‘true’ RR=2.

let me start from a simple example of a negative control where this seems not to be working: presbyopia. in this case, almost all of the study population (warfarin or dabigatran users) is older than 45 and we can safely assume that they almost all have presbyopia, regardless of exposure. however, it is well possible that only n people exposed to dabigatran have a record of presbyopia (n smaller than the people exposed), since it is a condition that in itself requires mild or no medical attention - for instance you can safely bet that this is what happens in my own database. as a consequence, when we add additional n simulated occurrences during exposure to dabigatran, we pick no new ‘true’ cases! so the ‘true’ risk of the positive control among the exposed remains the same as the risk of the negative control, therefore the ‘true’ RR of the positive control is still 1.

let’s now go to a less extreme negative control: obesity. event though, in this case, not all the exposed to dabigatran are obese, obesity is a highly prevalent condition among adults - safely more than 10%. at the same time, the outcome may well have imperfect sensitivity: for instance, i am confident that the vast majority of obese people cannot retrieved as such from my own database. so obesity is a case with high prevalence and (perhaps, only in some databases) low sensitivity. some computations then show that even in this case, when injecting n occurrences among the exposed, the RR increases by less than a factor 2. the reasons are two. first, when we select randomly the new cases, we must expect to pick some obese that we had not noticed, so we are in fact adding less than n cases; second, the true risk of obesity among exposed (that balances against the risk among unexposed) is not the observed risk, but the true risk. so, if we add a number of occurrences similar to the number of observed obese, we increase the risk proportional to the observed risk of obesity, not proportional to the true risk of obesity: as a consequence, we increase the risk by a factor which is discounted by the proportion of obese that we don’t observe.

let me do the math. the n cases of obesity observed in the exposed are a small percentage of the true cases of obesity: say that there is a number k bigger than 1 such that the true number of exposed obese is kn. therefore among the exposed there are (k-1)n true obese people that did contribute to the true risk, but that could not be observed from the data. now we inject n additional simulated occurrences, picked randomly among the N-n that we observed as non obese, N being the total number of persons in the exposed study population. the two problems mentioned above can be quantified as follows, where we denote by p the prevalence of obesity, that is, p=nk/N

problem 1) we must expect to pick randomly some of the (k-1)n true obese that we had not noticed. on average, the proportion of true obese among the n injected is (k-1)n/(N-n), or (after some computations) (p-p/k)/(1-p/k). therefore the number of new cases among exposed in the positive controls is not n but n(1-(p-p/k)/(1-p/k))=n(1-p)/(1-p/k). in short:

true new cases= n(1-p)/(1-p/k)

if p is 10% and k is 4 (so, a sensitivity of 25%), this factor is .92.

problem 2) the true cases of obesity in the exposed are kn, so the relative risk in the exposed of the positive control wrt the negative control is

(kn+(true new cases))/kn=(kn+n(1-p)/(1-p/k))/kn=1+(1/k)(1-p)/(1-p/k)

if p is 10% and k is 4, this is 1.23: the risk of the positive control in the exposed is therefore much smaller than 2.

and finally, notice that in the previous computations we have made no use of the fact that the prevalence of obesity is high. if prevalence is low, the first problem is in fact not relevant, but unfortunately the second problem is only dependent on sensitivity: if p is close to 0, the formula is approximately 1+1/k. in summary: when sensitivity is low, the positive control built by doubling the observed cases has a smaller RR than 2.

and a similar problem, in the opposite direction, could be argued if PPV is low: then the positive control has a higher risk than 2.

so, if i am right, the assumption in the sentence at the beginning of this message is equivalent to PPV=sensitivity=100%.

but, as i said, i would be happy to be proven wrong!

looking forward to the authors’ and others’ feedback!

rosa

schuemie · May 29, 2018, 6:39am

In population-level estimation, systematic error can occur for roughly two reasons:

Differences between the exposed and non-exposed (or target and comparator) that are not due to the exposure. For example, the people that get the exposure might already be sicker than those that do not, and for that reason alone have the outcome more often. These differences can also include differences in the likelihood that a true outcome is recorded in the data (detection bias).
Non-differential error in the detection of the outcome. For example, if the positive predictive value is 50% everywhere, then half of the outcomes we consider in our computation are not really the outcome, and we may be biased towards the null.

Negative controls are useful because they can help detect the first type of error, including detection bias. For example, if in the exposed group the sensitivity is 100%, but in the unexposed group it is 50%, the negative controls will show a relative risk of 2 even if there’s no true effect.

Negative controls do not inform us on what happens when the null is not true, when the true relative risk is greater or smaller than 1. This is why we introduced synthetic positive controls, where this first type of error is preserved as much as possible. We can then evaluate a method on these positive controls, and for example see whether there is immortal time bias, shrinkage towards the null, etc.

@rosa.gini’s confusion seems to stem from the fact that positive controls do not address the second type of error. To quantify this type of error, we desperately need the work headed by others as discussed here, because neither negative controls nor synthetic positive controls help here.

However, her math is misleading. The positive control synthesis does not require the assumption that sensitivity is 100%, just that it is the same for real and synthetic outcomes. I will try to explain in this example:

Imagine we know everything, that there are 10 true outcomes in the exposed group, 10 true outcomes in the unexposed group, and that the sensitivity is 80%, so we observe only 8 outcomes in each group. If we want to double the risk in the exposed, we add 10 more ‘true’ outcomes, and because the sensitivity is 80% we only observe 8 additional outcomes, so 16 in total in the exposed group. The true RR is 20/10 = 2. The observed RR is 16/8 = 2.

Now imagine we don’t know everything. All we see is 8 outcomes in each group. As our method prescribes, we double the number of observed outcomes in the exposed group, so we add 8 to get 16. The observed RR is 16/8 = 2. We don’t know the true RR, but as explained above, if we did know everything we’d know it is 2.

For those who care about the nitty gritty details: Rosa’s math assumes it matters that some of the people we inject outcomes in may already have the outcome in real life, even though we didn’t detect it because our sensitivity is not 100%. In that sense we are not really adding outcomes, and are not increasing the true RR as much as we think. The probability of this happening is small: most outcomes we study have a prevalance of less than 1%, and so the effect, if it really was a problem would be less than 1%. But it is not a problem at all; we are trying to estimate the first type of error, and the collission of hypothetical true but unobserved events is not relevant to this problem. As a thought experiment, if we simply allow for people to have two outcomes, one unobserved (due to the background rate) and one observed (synthetic, simulated to be due to the exposure), the math works out, and Rosa’s argument falls apart. Finally, as is tradition in OHDSI, empirical evidence should have the last word: the fact that on many occassions we have found study designs to be virtually unbiased both for negative and synthetic positive controls (e.g. the Graham replication in our paper) shows that at least in those studies this problem did not exist.

rosa.gini · May 30, 2018, 8:14am

thanks martijn for clarifying that the paper makes an assumption on positive controls: that the injected cases have the same sensitivity as the corresponding negative control. however, the assumption sounds somehow arbitrary to me: why wouldn’t someone else assume that sensitivity is double? of half? or 100%? this assumption propagates directly to the assumption about the ‘truth’ of the RR, which is the critical point of the paper: so knowing that this critical assumption is necessary does not really address my concern. but maybe there is something obvious that i am not seeing?

it would look safer to me to drop this assumption, and rather include an estimate of sensitivity in the computations. i agree with martijn that the estimate could be developed as suggested in this thread. if the estimate is only approximated, the impact on the results of a set of scenarios for sensitivity could be explored.

on the other source of variability of the ‘true’ RR: martijn and i agree that, no matter what k, the difference between n(1-p)/(1-p/k) and n is small if the prevalence p of the outcome is small. i am more cautious in claiming that all the outcomes (beyond presbyopia ) have low prevalence, though, because what we are interested in here is the prevalence among exposed, not in the general population. in the specific case, there are several negative controls which are related to diabetes, which is highly prevalent among dabigatran users, and p is likely to be much higher here than in the general population. so this is something that could be usefully included in the pipeline for generating the positive outcome: indeed if scenarios on sensitivity are tested, then p can be estimated by kn/N.

as for the fact that non-differential PPV is not of interest: i am not sure i understand why it is not. indeed, it is a classical result that non-differential PPV (which, in itself, does not bias the point estimate of the RR, so, does not induce systematic error) does have an impact on the variance, which is precisely what this paper is addressing. but maybe, also here, there is something obvious that i am not seeing.

i have the feeling this can be replicated in case there are also false positives, that is, PPV is <100%. i will develop the computations and share them here - or, if someone else has already made the computations or seen them in the literature, that would be great. i would also be happy to read any additional input/comment!

Christian_Reich · May 30, 2018, 1:04pm

@rosa.gini:

I think you have a point. Doubling the observed (with low sensitivity or not) will double the RR only if the prevalence is small. If not, we start eating away the denominator. In fact, such doubling could exhaust all patients in the population. Obviously, in this extreme case of your calculation, @schuemie’s assumption of a good positive control will fail.

There are other artifacts that can happen. These controls should measure the ability of the method to control for differential error (#1 in @schuemie above). How does the method do that? It uses other data to measure this bias, and then adjusts for it. If we now add “surgically clean” outcomes, which are not susceptible to all sorts of shenanigans in the data, the method may get tripped and now overcorrects (assuming the clean ones are also biased) or undercorrects (because it cannot estimate the correct bias anymore, or like in your low sensitivity obesity situation). Who knows.

It is possible that these effects equal each other out and you get a perfect RR=2, even though the method is completely off. Likelihood of that is very very low, though.

Plus: We are not proposing to use one control, but many. And it is practically impossible that all these effects conspire against us in all the tests: frequent and rare ones, with high and low sensitivity, heavily biased or not by one thing or another.

schuemie · May 30, 2018, 2:36pm

@Christian_Reich: Just to be clear: if over half the persons in the exposed group has the outcome (in the data), and you try to double it, of course the code throws an error because you’re trying to do something that is not possible.

Christian_Reich · May 31, 2018, 11:03am

@schuemie: Would it make also sense to throw an error even when less than the entire population, but getting close enough to noticeably change the RR? That would help taking care of one of @rosa.gini’s points.

With respect to the other points and how to pick good controls: @ericaVoss, @rkboyce and I are working on a cookbook of how to pick, what to watch out for and what categories there are. We will add both the low sensitivity issue and the high background prevalence issue.

schuemie · May 31, 2018, 11:19am

@Christian_Reich, the injection function has a precision parameter that specifies the allowed divergence between actual injected size and target effect size. The default value is 1%, so if the injected size is more than 1% different from the desired effect size the program throws an error.

rosa.gini · June 8, 2018, 10:28am

i have completed the math for the case when PPV<100% and for the case when both PPV<100% and sensitivity<100%. find at the bottom of this post the detailed proof of the general case (which contains the case sensitivity 100% as a special case). a summary follows

===========================
-) case when PPV<100%,

if the number n of observed cases among exposed is doubled, but some of them were not true cases in the first place, we more-than-double the true cases, hence the true risk in the exposed is inflated more than expected. in formuals: let’s call v the PPV of the outcome, that is, the proportion of false positives among the n observed cases, then the number of true cases of the negative control in the exposed is vn. so the relative risk in the exposed of the positive control (when n new cases are injected) wrt the negative control is

(vn+(true new cases))/vn=(vn+n)/vn=1+1/v

since v<1, 1/v is bigger than 1, and therefore 1+1/v>2.

-) general case: PPV<100%, sensitivity<100%

now, the general case: v<1 and k>1, k being the inverse of sensitivity (k=1/sensitivity) and v being the PPV. the number of true cases of the negative control among the exposed is then kvn, among which (k-1)vn unobserved. the true prevalence of negative controls among the N exposed study participants is kvn/N, let’s call it p. the proportion of true unobserved negative controls among the n injected is (k-1)vn/(N-n), which after some manipulation becomes (p-p/k)/(1-p/kv). the number of truly new subjects injected by injecting n subjects is therefore n(1-p)/(1-p/(kv))

so the relative risk in the exposed of the positive control (when n additional cases are injected) wrt the negative control is

(kvn+(true new cases))/kvn=(kvn+ n(1-p)/(1-p/(kv)))/kvn=1+(1/kv)(1-p)/(1-p/(kv))

=================================

therefore injecting n cases increases the risk in the exposed by 1+(1/kv)(1-p)/(1-p/(kv)), which is 2 only if kv=1, that is, sensitivity is equal to PPV (in particular, if they are both 100%). some typical situations are represented in this graph: three values of p (1%, 5% and 10%), three values of PPV (50%, 75% and 100%), and the RR is represented a function of sensitivity.

RR.pdf (60.9 KB)

the formula takes values that range from 1.5 to 3 for combinations of PPV and sensitivity that may easily occur, like low sensitivity and high PPV. this is specifically the case of European databases, where sensitivity of the same condition may vary a lot between databases, and sensitivity of a database may vary a lot between conditions, as shown in this poster presented in rotterdam.

so my conclusion would be that replacing the assumption that RR is 2 with the formula above could be a way forward. this would require that reasonable scenarios for sensitivity or prevalence, or for sensitivity and PPV, or for PPV and prevalence are available - two of the three are enough, as discussed in this thread - for each negative control. providing such scenarios should be part of the development of the set of negative controls, and should be performed per database - althought some general principles may be validated and later applied at scale, and i think the methodology in the poster may be of help in that.

what do people think? @Christian_Reich: i didn’t understand what conspiracy you were picturing in your post (although i enjoyed the image a lot), does this discussion support your conclusion?

computation.pdf (138.6 KB)