OHDSI Home | Forums | Wiki | Github

Selection of RCTs

(Martijn Schuemie) #1

We’ll need to select a set of RCTs that we can replicate in observational data. Some criteria we may want to consider:

  • Large trials only, so the result of the trial has some accuracy which is nice in a gold standard :wink:
  • Recent trials, so we can restrict our observational data to time prior to when the results of the trial were known
  • (Related to the previous criterium:) Only trials of drugs that were already on the market at the time of the trial, so we actually have observational data prior to the trial
  • Trials with inclusion/exclusion criteria we can also apply to our replication.

I will ping Adler to see if he can help. @nigam, @Patrick_Ryan, I know you also have done some work that might help.

How should we proceed?.

(Alejandro Schuler) #2

I’ve been mulling this over for some time…

It’s useful to consider the diversity of cohort/treatment/outcome tuples we use. For instance, the best methods to determine the impact of an invasive vs. noninvasive surgery on in-hospital complications would likely be different than the best methods to determine the impact of two second-line drug choices for diabetes on 10-year mortality.

In general, the cohesiveness and disease domain of the cohort, the acuity of the intervention, and the timescale and acuity of the outcome will all define general patterns of observed and unobserved confounding. Over-sampling some of these structures will create bias in the result we get.

The ideal thing to do would be to sample at random from cohort/treatment/outcome tuples, but obviously there aren’t clinical trials for the vast majority of those. Next best might be to sample randomly from existing clinical trials, but then we have to make considerations for trial size, recency, what we can get in the EHR, etc… and now we’re back to what is likely a very biased sample of trials. Because of that, we should be careful about making a one-size-fits-all prescription because we can’t avoid these biases. It might be best to try and get results for subsets, e.g. what are the best methods for short-term pharmaceutical interventions on long-term mortality for chronic diseases?

(Martijn Schuemie) #3

Something I’ve been thinking about that is related to this, but also our selection of negative and positive controls: The evaluation I have in mind for this task force is an evaluation of generic utility of a method. In other words, I believe there are methods (and analysis choices) that are always better than other choices, no matter what the problem, and our evaluation should identify those. On the opposite end of the spectrum (I think you also allude to this in your white paper) is the utility for a specific clinical question. If, for example, I’m looking at Keppra and angioedema, I want to know which method is best to answer that specific question. We may not have enough empirical data points (a gold standard that is very similar to the clinical question at hand) to exhaustively evaluate all methods for this second type of utility, so perhaps we can use a Bayesian approach where our prior beliefs on what method works best are informed by the overall experiment we’re doing here.

I think you’re advocating for an intermediary: utility for a class of clinical questions. My only worry here is that these type of classifications often tend to be subjective rather than objective. For example, what are the criteria for identifying “short-term pharmaceutical interventions”?

(Alejandro Schuler) #4

Yep, you hit the nail on the head! There is no free lunch here. There must be choices that are generally better than others (e.g. some kind of matching should never hurt). Our claims as to what those are will generalize as long as we pay attention to the kinds of RCTs we end up with in the gold standard pool. It’s probably going too far to define particular buckets like “short-term pharmaceutical interventions” and making sure we have enough RCTs in each bucket, but we should at least be able to squint our eyes and from 30,000 ft. be able to claim that we roughly covered enough domains that we feel reasonable in assuming that there is no one class of confounding structures that is not well-represented in the pool of trials.

I carry around a metaphysical homunculus of John Ioannidis on my shoulder… at this moment he’s telling me we should entirely pre-specify the protocol we will use to select trials and include 1. a figure demonstrating how many trials out of the database we’re left with after each filter we apply and 2. a figure showing some aggregate statistics about the final set of trials (e.g. domains, sizes, length of time to outcome, class of interventions- we can use an ontology to determine the categories for certain things). With that, doing our 30,000 ft. judgement of how well we expect things to generalize is a matter of looking at that figure 2 and waving our hands. As long as it’s transparent and in the body of the paper, then it’s open to fair critique and nobody can claim we have a biased selection of trials.

(Alejandro Schuler) #5

In terms of utility for a specific question- yes, that is something I am very interested in and I discuss it in my white paper.

Let’s address it more directly here. Consider each question as its data-generating function Y = H(X, U, W), having treatment effect t(H) = E[H(X, U, 1) - H(X, U, 0)], which generates an observed dataset O(H) = (X, W, Y). We have a set of methods M where each method M takes a dataset and produces an estimate of the effect: t’ = M(O(H)).

We want to find the argmin of M(O(H*)) - t(H*) over M in M for a specific question H*. Note that this is totally analogous to the general learning problem: find the argmin of F(x*) - Y(x*) over F in F for a given point x*.

The problem is that we generally don’t have t(H*). If we did, we wouldn’t care what the best inference method for that question is because we already have the treatment effect. Again that’s analogous to the learning problem: we don’t have the value of the function Y at the point x* or else we wouldn’t care about learning it.

We’ll use the same trick that’s used in the machine learning, which is to approximate F(x*) - Y(x*) as E[F(x) - Y(x)] where the expectation is taken over a set of points x in X for which we have measurements of Y(x). We can think of these points as forming a neighborhood around x* and we will use them as surrogates and average over the variation.

The fundamental question, and the essential difference between all learning algorithms, is how exactly you decide what is a “neighbor” and how you weigh the surrogate points to average over the variation. Perhaps the simplest algorithm is K-nearest-neighbors, which uses uniform weighting over a set of neighbors defined by a distance: D(x,x*).

The translation to our methods evaluation setting is to find M that minimizes E[M(O(H)) - t(H)] where H are from a set H of questions that are “neighbors” to H* for which t(H) is known. To define that set we need a distance metric G(H,H*). We never know the true data-generating functions, but we can do well by using the observed data as a proxy to define a distance metric J(O,O*). What this tells us is that the best datasets to use to evaluate methods for a question at hand are those that are most similar to the dataset at hand. It’s intuitive almost to the point of tautology. The algorithm I lay out in the white paper does exactly this. It finds the argmin of J(O(H),O*) subject to t(H) = t over a set of questions H in H: it finds the questions that are most statistically similar to the question at hand in terms of the generated datasets, but for which the treatment effect is known. That’s precisely the neighborhood we are looking for.

There is one hitch: The distance G(H,H*) is not perfectly captured in J(O,O*). One part of that is the question of unobserved confounding- if the dataset O contains no information about the unobserved confounding U, then how can we tell if the relationship between U and Y that exists in H* is preserved as much as possible in the generated neighbors H? The answer is that fundamentally we cannot. It is simply not possible because we can never observe that quantity. However, if Y* = H*(X*,U*,W*) is close to Y = H(X*,W*), then (Y*, X*,U*,W*) should not be very different than (Y, X*,U*,W*). That means to say that as long as we remain close to the observed data, the relationships with the unobserved variables will be relatively well preserved. We just can’t quantify by how much.

Is there an alternative? Because of the relative paucity of clinical trials and their quality, we don’t have many, if any, datasets O that are near to O*. There are innumerable differences between any two observational datasets and I would argue that using any real dataset O’ to approximate O* would be further away then using a semi-simulated dataset O because O is generated in a way that specifically minimizes that distance. In other words, it will always be the case that J(O*, O’) > J (O*, O) except under extremely pathological conditions. In addition, the unobserved confounding structure present in H’ is not more likely to mimic that of H: it cannot be shown that G(H*, H’) - J(O*, O’) < G(H*, H) - J(O*, O). In fact, because of the argument about small perturbations not disturbing the unmeasured confounding, it is more likely that G(H*, H’) - J(O*, O’) > G(H*, H) - J(O*, O). The conclusion is that it is difficult to conceive of a case where G(H*,H’) < G(H*,H) for any real clinical questions H* and H’ and a semi-simulated model H based off of H*. It is therefore very difficult to make an argument for using only real data to evaluate the utility of a method for a specific question.

There are many possible ways to include the results for real questions and weight them appropriately relative to the results on semi-simulated datasets (that’s how I conceive of the Bayesian approach Martijn describes). Or, despite my theoretical arguments against it, one might use only real data. How can we empirically tell what’s the best way to do it?

Again there is a perfect analogy to the general learning problem. The different strategies and choices are analogous to different learning algorithms that make different assumptions. For instance- the analysis we are proposing to do for a large set of gold-standard RCTs is analogous to finding the mean of the response variable Y and saying F(x) is a constant function that is equal to the mean of Y (one-size-fits-all). Using subsets of those trials or doing a Bayesian analysis gets closer to adaptive learning methods. Making semi-simulated datasets with my algorithm or doing signal injection as before are analogous to data-augmentation techniques that are used in vision and speech processing.

Just as in the general learning problem, we can do hold out a test set or use cross-validation to find the best question-specific method evaluation approach. For instance, to compare the general best method to using my algorithm to find the question-specific best method, one would do the following:

  1. split the gold standard RCTs 70/30
  2. find the method that best predicts the treatment effect from the corresponding observational data on the 70% training sample of RCTs
  3. Find the MSE (or AUC) of using that method to predict treatment effects on the observational data corresponding to the the 30% sample of RCTs
  4. For each trial in the 30% sample, run my algorithm on the corresponding observational data to find the best method
  5. then run that dataset-specific method on the observational data to estimate each treatment effect and calculate the error from the real treatment effect
  6. Average those errors to get the MSE (or AUC) of my algorithm
  7. Compare the errors from the one-size-fits-all method to the errors from my algorithm

That will essentially tell us if we get anything out of doing question-specific evaluations and if my algorithm is useful in practice. All of this is precisely what I intend to do in parallel with the big general evaluation that we are working on as OHDSI.

(Nigam Shah) #6

We should download clinicaltrials.gov data from:
https://clinicaltrials.gov/ct2/search/advanced selecting, completed studies
with results, that are phase 3 or phase 4 and last 5 years.

The site then gives a csv download, with fields for study size etc. It’s on
my to-do list to sit down with Alejandro to walk him through this once
together. We then sort by study size, and can review the top 50 or so on a
task force call.

We should also poll people like the Jeff Drazen at NEJM for suggestions on
trials to use as a gold standard. I will be at their meeting April 3-4 and
can ask informally or formally.

(Vojtech Huser) #7

The CSV export is good but I suggest to use this version of the CT.gov database: http://aact.ctti-clinicaltrials.org/connect

For example: (only drug trials with results until some cut-off date (and other criteria)

select * from  interventions 
where nct_id in (select distinct nct_id from studies where study_type = 'Interventional' and number_of_arms=1 and nct_id in (select nct_id from interventions group by nct_id having count(*) =1) 
and nct_id in (select nct_id from interventions where intervention_type = 'Drug')) 
and nct_id in (select nct_id from studies where first_received_date < '2016-02-16') "

(Martijn Schuemie) #8

Volunteers needed!

I would like to start a small pilot. Would someone be interested in doing the following task?

Pick two RCTs, one that is placebo controlled (to evaluate methods on the question “What is the effect of drug A on outcome X?”), and one that uses an active comparator (for the question “What is the effect of drug A on outcome X compared to drug B?”).

These two RCTs further need to meet the requirements discussed above:

  • Large trials only, so the result of the trial has some accuracy.
  • Recent trials, so we can restrict our observational data to time prior to when the results of the trial were known
  • Only trials of drugs that were already on the market at the time of the trial, so we actually have observational data prior to the trial
  • Trials with inclusion/exclusion criteria we can also apply to our replication.

Once selected, the task further involves implementing the inclusion and exclusion criteria in ATLAS.

(Adding some folks: @saradempster, @hripcsa, @Vojtech_Huser, @nigam, @nicolepratt, @aschuler, @Patrick_Ryan )

(Vojtech Huser) #9

Let me take the volunteer role. :slight_smile:

I made some steps toward picking a trial. (using CT.gov database)
I used >2016-01-01 for recent trials.
I initially focused finding the placebo controlled trial.
I used manual criteria for ‘drug on market’

The hard part is reviewing enrollment criterial for each trial. For example trial https://clinicaltrials.gov/ct2/show/study/NCT02219932 fits other criteria but has enrollment criterion of (-Must have an Expanded Disability Status Scale (EDSS) score of 4 to 7, inclusive)

I emailed some interim results to Martijn and will work with him around some of my other questions.


I did a search on clinicaltrials.gov using Sherlock, a JNJ tool that downloads the clinicaltrials.gov database.
It looks at randomized, phase 4 studies with 2 arms,that include a drug or a biologic, do not have a placebo arm and have results.
100 studies included 500 or more subjects. see the attached file
soledad cepedaRCTMS(2).xlsx (97.3 KB)

(Martijn Schuemie) #11

Thanks Soledad!

I’ve taken your list, and have added the number of people exposed to the interventions prior to the study completion date in a US insurance claims database. The spreadsheet contains the person count for the least prevalent of the interventions:

RCTMS(2)WithCounts.xlsx (92.4 KB)

Maybe we could pick one from the top of the list? The study in Niger (NCT00618449) compares two dosing regimes, so maybe not suitable for many databases where we do not have accurate dosing information. Some of the other studies are on very specific combination drugs, which maybe should be excluded for similar reasons.

(Seng Chan You) #12

Since I’m a cardiologist, I want to suggest RCTs in this field.

This is a RCT comparing 40mg pravastatin vs. 80mg atorvastatin (moderate vs. intensive dose of statin) among patients hospitalized for an acute coronary syndrome, which was published in NEJM, 2004.

Interestingly, although guideline suggest to use intensive statin treatment in ACS patients, intensive statins are underutilized worldwide.

I’m not sure this RCT is suitable, because it is quite old. But the inclusion criteria is quiet simple. And I think we can enroll enough patients.

(Vojtech Huser) #13

I saw another possible RCT.

In 2016 - FDA alerted that canagliflozin MAY BE causing foot amputations.
See the announcement here

In may 2017 - they declared causality.


[ 5-16-2017 ] Based on new data from two large clinical trials, the U.S. Food and Drug Administration (FDA) has concluded that the type 2 diabetes medicine canagliflozin (Invokana, Invokamet, Invokamet XR) causes an increased risk of leg and foot amputations. We are requiring new warnings, including our most prominent Boxed Warning, to be added to the canagliflozin drug labels to describe this risk.

We could compare how the physicians behaved prior causality announcement.

Per wikipedia, canagliflozin was approved in 2013. The atc class is A10BK and it seems to be growing in pupularity since 2013.


I selected two studies:
NCT00206102 and NCT00975195

(Seng Chan You) #15

Dear @Vojtech_Huser
I’m really interested in the association between SGLUT2 inhibitor and foot amputation, too.

(Martijn Schuemie) #16

@SCYou: yes, I think the atorvastatin study is too old . We will have very little data in most databases prior to 2003 (completion date of the study).

@Vojtech_Huser: I would prefer not to use a canafliflozin study as a means to measure method performance because my company makes that drugs and I therefore have a confict of interest.

@scepeda: I like the study involving risperidone you mentioned. The drug has been on the market for a long time prior to completion of the study, and the study is a simple two-arm RCT.

(Seng Chan You) #17

I suggest that the comparison of cardiovascular safety between Febuxostat or Allopurinol in patients with gout would be good RCT for the replication, which is published in NEJM yesterday.

It is quite large trial (more than 6,000 patients enrolled.)
It is definitely recent trial.
Febuxostat was approved in 2009.
Actually, I preferred febuxostat to allopurinol. I’m convinced that most physicians don’t know about the possibility of cardiovascular adverse effect in febuxostat. You can see the previous CONFIRM trial for investigating the efficacy and adverse event between the two drugs here.

We cannot apply the same inclusion/exclusion criteria for the claim database because it requires serum urate level and creatinine level. But we can use hospital data. Or we can exclude the patients whit previous renal failure or kidney injuries in the claim data.

This RCT was primarily designed for the finding adverse event of the drug, not the efficacy of drug. That’s why this is the one of the best RCTs we can replicate.

How do you think about this? @schuemie @Patrick_Ryan

(Patrick Ryan) #18

This is a great idea @SCYou! This seems like a nice example of study that
we could try to approximate in our data network. While we don’t have serum
urate or creatinine, we can approximate some of these characteristics with
diagnosis codes and make sure that we balance uor populations on whatever
observations (and lab values) that we do happen to have at baseline.

I created a draft cohort definition to get you and others in the community
started with: ‘New users of febuxostat with gout and cardiovascular
disease’ (http://www.ohdsi.org/web/atlas/#/cohortdefinition/1734846) and
‘New users of allopurinol with gout and cardiovascular disease’ (
http://www.ohdsi.org/web/atlas/#/cohortdefinition/1734847) . You’ll note,
I created the cohort definition with inclusion criteria so you could assess
whether the requirement of gout or prior CV disease was negatively
impacting inclusion. The impact of each criteria varies widely by database
on my side, but in general I’m surprised how many people fail one but not
both of the inclusion criteria, suggesting gout and CV disease may be in
some way negatively correlated. Certainly as we move forward with this,
we’d want to try to represent additional inclusion criteria from the trial
as well.

Across the data I have access to, it looks like I have a promising number
of candidate patients (larger than the trial population size) for ‘new
users of febuxostat’ in IMS Germany, Truven MDCR, Truven CCAE, and Optum,
but <1000 patients in other databases. I see a greater number of patients
on allopurinol in each source…so we’ve got the start of a T and a C, we
need to define an O (which appears to be a MACE endpoint that I know others
in the community have previously defined).

(Seng Chan You) #19

Great! @Patrick_Ryan
I always admire your passion for OHDSI!
I generated cohort with my databases according to your definition on the ATLAS.

The number of cohort population with febuxostat is quite small (febuxostat, n=645; allopurinol, n=4550), which is actually much more than I expected.
I think I can get some result among these population for their death or cardiovascular mortality by adjusting only condition or drugs (I have the data for the cause of death).

And I want to mention that we need to generate more evidence for the general population.
This NEJM study focused on the population with previous cardiovascular events, because they have much higher outcome rate, which would decrease the total fund of the study.

Of course, the main purpose of this study is replicating the RCT to evaluate our methodology.
But if we can generate the evidence for the general population, it is much more helpful for the clinicians, patients and FDAs in the world.

(Seng Chan You) #20

We need to read the protocol of febuxostat paper to replicate this study as same as we can.