@Christian_Reich, these are valid scenarios that do regularly occur for which we’d like to have better methods for, and present good opportunities for further research.
To provide my intuition on why a comparative cohort design works well in certain circumstances as a proxy for what we ‘really’ want:
We are attempting to model a hypothetical counterfactual scenario: a person faces a decision between two alternatives (the Matrix’ ‘red pill’ or ‘blue pill’). In reality, one a decision is made (Neo picks the ‘red pill’), you observe the effects of that decision. You’d ideally like to contrast that to the effects that would be observed had to you made the alternative decision (e.g. what would have happened had Neo picked the ‘blue pill’ instead?). If we had Doc Brown’s Delorean, we’d go back in time to the split-second before the decision of the red pill, swap it for the blue pill, then observe what happened to Neo. The difference between Neo’s experience on the red pill and the blue pill would be the ‘individual causal effect’ that we are aiming to estimate. If you could find a bunch of Neos and, for each person, go back in the time machine to replay the decision, you could use the collective set of experiences to produce an ‘average treatment effect’, which would be the expected value of the individual causal effects.
Since we don’t have a time machine (yet), our first best approximation is a randomized trial, we we take a cohort of persons facing the same decision, and we randomize them into two cohorts- the target cohort of the persons assigned to take the red pill, and the comparator cohort of persons ‘assigned’ to take the blue pill. we estimate ‘an average treatment effect’ as the difference in outcomes between the target cohort and comparator cohort. Note, you no longer have a ‘individual causal effect’, because you don’t observe the effect of the red pill and the blue pill, but instead make an exchangability assumption that the people assigned to the red pill and sufficiently similar to the people assigned to the blue pill. This exchangeability assumption is often empirically assessed by comparing the baseline characteristics of the target cohort and comparator cohort to ensure that covariate balance is achieved.
Why am I being pedantic about randomized trials when you are asking about how to conduct an observational study? Because, before you can consider how good an observational study could be, it is often helpful to start by asking, ‘what would the ideal randomized trial look like to answer this question?’. At the OHDSI Symposium in the plenary talks on LEGEND from me, @schuemie and @msuchard, we used this approach as a useful construct, as advocated in Miguel Hernan and Jamie Robins’ perspective piece in AJE.
So, in the scenario you lay out: you have an established intervention for a given indication and a new intervention for the same indication which is generally only used when the established intervention has been proven ineffective. What RCTs could we imagine?
Amongst the cohort of persons with the indication, randomize treatment-naive persons to either the established intervention or the new intervention. This will give you an estimate of the average treatment effect of established vs. new intervention amongst persons with the indication (but will not tell you about the effects of the new intervention amongst established failures).
Amongst the cohort of persons with the indication who have ‘failed’ established intervention, randomize persons to either new intervention or placebo. This will give you an estimate of average treatment effect of the new invention vs. placebo amongst established failures (but will not tell you about the effects of new intervention vs. established intervention).
Note, these are two different causal contrasts. Neither are right or wrong, they are just complementary questions. And it introduces something for you to consider: what causal question are you actually trying to answer? Do you want to know about the absolute effects of the new intervention (vs. not exposed) or the comparative effects of the new intervention relative to the established intervention? And in which population do you want to the established causal effect to belong to: treatment-naive persons with the indication, or persons with the indication who have previously failed treatment?
Only after you define the causal contrast that you are seeking to estimate should we start thinking about the study design to produce the estimate.
Now, since we aren’t going to prospectively conduct an RCT, one tact to take is to try to closely approximate the ideal RCT design with a retrospective observational database design.
Take those two RCTs above, and look how the new user comparative cohort design could support them:
- RCT: Amongst the cohort of persons with the indication, randomize treatment-naive persons to either the established intervention or the new intervention.
Observational study analog: Amongst persons with the indication who were previously treatment-naive, identify a target cohort who chose the established intervention and a comparator cohort who chose the new intervention. Our index date: the ‘decision’ date of choosing established or new intervention.
- RCT: Amongst the cohort of persons with the indication who have ‘failed’ established intervention, randomize persons to either new intervention or placebo (or whatever alternative ‘standard of care’ would be offered following established treatment failure).
Observational study analog: Amongst persons with the indication who were exposed to the established intervention and were observed to experience ‘treatment failure’ and stop this treatment, identify a target cohort who chose to initiate treatment with the new intervention, and a comparator cohort who chose to only stop the established intervention without starting the new intervention (if some alternative ‘standard of care’ was available, this could serve as the comparator). Our index date: the ‘decision’ date of failing the established intervention and choosing whether to initiate new intervention.
In both cases, the data should be limited to the time period where the alternative treatments were both available to a person, so that a legitimate decision had to have been made. And in both case, we can use all information on or before this index date (e.g. baseline covariates) to evaluate whether the target and comparator cohorts are comparable enough to not explicitly violate the exchangeability assumption. And if the two original cohorts are not sufficiently comparable, we have statistical adjustment techniques (such as propensity score matching/stratification) that we can apply to make new target/comparator cohorts which can also be empirically evaluated for baseline covariate balance. Some important diagnostics to evaluate BEFORE YOU UNBLIND YOURSELF TO THE OUTCOME is to determine 1) are the target and comparator cohorts sufficiently near clinical equipoise to be a good enough proxy for the counterfactual ideal? and 2) have you achieved adequate covariate balance across all baseline characteristics?. This is also the point in the process where we’d advocate for using negative controls as an powerful diagnostic instrument to attempt to observe other sources of systematic error, potentially due to confounding, selection bias, or measurement error. (Here’s an example where the combination of covariate balance diagnostics and negative controls helped to identify channeling bias that is persistent in many published studies). At this point, you can also evaluate there is sufficient follow-up time available in the database (e.g. is the desired ‘time-at-risk’ observable in both cohorts, with an appropriate distribution?), and you can use this information to assess statistical power (e.g. do you have sufficient number of exposed with adequate follow-up time to observe enough outcomes to detect a causal effect if it does indeed exist). It is reasonable and appropriate to STOP your study at this point if any of the diagnostics above suggest that your estimated effect will be uninformative (either because its not a proper counterfactual proxy, or because there’s residual systematic error due to confounding or bias from other sources, or because you do not have adequate sample to provide sufficient precision for the question of interest)
To return to your scenario: you have an established intervention for a given indication and a new intervention for the same indication which is generally only used when the established intervention has been proven ineffective. So, in this case, its quite possible that you will not have enough sample for Study design #1 above, because there may not be many observed treatment-naive persons who chose the new intervention. And, even if study design #2 addressed the causal contrast that interests you, you may have a sample issue with finding persons who chose to not to have the new intervention if that’s recommended care or the choice of ‘do nothing’ is considered inappropriate. In either case, you may find yourself in a situation where your diagnostics tell you to ‘stop’ because you can’t produce a precise and valid estimate for the desired causal contrast.
But you may be like a lot of researchers who do not want to ‘stop’, even if the data tells you to. Or you might think that relaxing some assumptions might not be ‘too bad’ and that producing a ‘slightly less valid’ estimate may be better than the current reality of nothing. This is tricky business, and I think the OHDSI community could contribute a lot to the field if we could develop additional empirical diagnostics to help with this type of compromised decision-making. While I don’t have any direct solutions for you, but here’s consequences of relaxing the time period comparability that you should consider:
If you do not require the target cohort and comparator cohort to belong to the same time interval when both alternative treatments were available (e.g. you use data from when the established intervention was used prior to the new intervention being introduced to practice): you are compromising the exchangeability assumption by having persons whose baseline characteristic of ‘index year/index month’ will not be comparable between the two cohorts. In most circumstances, index year/month matters a lot, because clinical practice (and data capture pattern) evolves substantially over time, and if you did observe a difference, you wouldn’t know if its because of the intervention or the surrounding care. Also, you are violating the counterfactual framework, because persons who ‘chose’ the established intervention didn’t actually have the alternative choice of the new intervention…you’d be assuming that those persons would have still made the choice of established intervention had they had the new intervention available as an option, and that’s an untestable assumption that could be potentially problematic. One additional diagnostic you could perform would be the compare the baseline characteristics of the persons in your target cohort prior to new intervention introduction with the baseline characteristics of the persons in your target cohort after new intervention introduction, to attempt to see if there’s been any ‘patient drift’ over time. But observing no difference would not be a definitive marker than the comparison is still valid (though it’s no worse than the set of assumptions made when single-arm RCTs are conducted and a ‘historical control arm’ is created to serve as some proxy comparison).
A much more problematic approach which I would generally advocate against would be to compare treatment-naive persons exposed to the established intervention with persons exposed to the new intervention who previously failed the established intervention. Comparing across different ‘lines of therapy’ is particularly difficult, because you are really observing different decision contexts: ‘what to choose to do first?’ is fundamentally different than ‘what to choose to do second after the first option didn’t work?’. Considered another way, you couldn’t design a RCT for this question, because there’s no single moment to assign persons to the alternative treatment arms. These types of comparisons also run the risk of introducing immortal time bias (because the persons in the latter line of therapy had to ‘survive’ through the prior lines), and in this particular case, has the risk of exposure misclassification, since the comparator group would have also been exposed to the established intervention. Again, it would be useful for our OHDSI community to develop empirical diagnostics to accompany the theoretical concerns should you or others decide to move forward in this direction and wish to assess the reliability of the estimates generated.