First step: Defining the broad research approach

schuemie · March 10, 2017, 7:03am

The first step will be to define the broad research approach. This should help us identify which papers will be written and possibly might help folks in getting grants.

We’ll need to answer three major questions:

Which methods will we evaluate?
What gold standard will we use?
What databases will we run the experiment on?

My two cents:

Even though we might later want to evaluate other, new methods, I would propose we start with the OHDSI Methods Library.

For gold standard I would propose we use negative controls and synthetic positive controls derived from negative controls as shown in my presentation at the last OHDSI Symposium. We would need to decide on the exposures for which we will then pick the negative controls, and I know different people have different ideas about how to synthesize positive controls.

I would at least run the experiment on the internal JnJ databases. These would include several large US insurance claims databases and CPRD.

But this is just a straw man. Let me know your thoughts on choices for methods, gold standards, and databases! (@nigam, I think I remember you had different ideas?).

(Including everyone so they’ll be notified even when not Watching: @aschuler, @brian_s, @David_Madigan, @hripcsa, @jweave17, @nicolepratt, @nigam, @TengLiaw, @Vojtech_Huser, @yuxitian )

hripcsa · March 10, 2017, 11:55am

I will slowly in parallel also look into use of published RCTs as a gold standard. Martijn and I had discussed using a population of positive RCTs so that you don’t need to know the actual effect size for the group to be useful. I think readers will feel most comfortable with positive RCTs. Has someone already parsed the positive RCTs from the literature or ClinicalTrials.gov?

Thanks for the reminder about watching. (Including everyone so they’ll be notified even when not Watching: @aschuler, @brian_s, @David_Madigan, @hripcsa, @jweave17, @nicolepratt, @nigam, @TengLiaw, @Vojtech_Huser, @yuxitian )

George

nigam · March 12, 2017, 3:06am

@schuemie; Not remembering what I came up with :-(. Will come back to me …

schuemie · March 12, 2017, 5:04am

Hi @nigam,

I’m referring to your comments on a “good enough benchmark of RCTs to reproduce”

I think this is in line with what @hripcsa is saying. I’m not yet sure whether known positive controls (as those identified using RCTs) can be used for evaluation since doctors (should) also know they’re positive and might change their behavior to prevent the outcome from occurring.

hripcsa · March 12, 2017, 5:44pm

We could use data before the RCT publication to stop the result from influencing practice. George

nigam · March 13, 2017, 3:46am

I think known positive controls from RCTs can be used as long as we use the
observational data from a period prior to the publication of the RCTs
results. For example, if we can answer the question of whether intensive
management of BP is better (or not) than regular management of BP using
data prior to the SPRINT trial results that could work well.

Thanks for the memory nudge. The thing I was proposing was to find trials
that are completed in the last two years, sort them by size, and use the
ones that have >5000 patients as the good enough benchmark of RCTs to
attempt and reproduce.

schuemie · March 14, 2017, 4:42am

That makes sense! I think there are two ways to use RCTs as gold standard.

One way, as @hripcsa suggested, is to use RCTs to identify non-null relationships without attempting to consider the magnitude of the effect. This means we could for example compute AUC (ability to separate negative from positive controls), but not compute coverage of the confidence interval.

The second way would be to see if we can reproduce the actual effect size observed in the RCTs. If the RCT reported a relative risk of 2, we can evaluate whether a method also find a relative risk of 2 (or close). One challenge here is that the population in our databases will likely be different from the population in the trial, and at the very minimum we’ll need to impose the same inclusion and exclusion criteria as used in the trial.

The second way would be much more powerful, hopefully showing that well-designed observational studies can approximate clinical trials. But it is also more difficult to achieve because we’ll probably not be able to make the populations 100% comparable, or because we cannot measure the outcome with 0% error. Which way should we go?

hripcsa · March 14, 2017, 12:08pm

Actually, mine isn’t quite just positive or negative. It did include covering the RCT effect size.

I suggested we maximize the proportion of observational studies whose CI covers the RCT result but excludes 1 (if the RCT CI excludes 1). You can’t just go for covering the RCT result or else you can achieve perfection by multiplying all the observational CI lengths by a million.

Now if you know that your observational power is low then you should still include the RCT effect, but you may not expect to include 1. Not sure if we have to correct for that.

And that we could use as a silly baseline, simply multiplying all the observational CIs by whatever constant maximizes the proportion.

George

hripcsa · March 14, 2017, 1:02pm

Thinking further, our previous evaluations of calibration just checked that we achieved an actual 95% rate. But we are not merely enlarging the CIs. We are also shifting them. If we were just enlarging them, then we could argue that it is a simple tradeoff between PPV and NPV, and we are shifting the threshold to achieve 95% without losing the study’s predictive power. But because we are also shifting the mean, we could be losing predictive power. To take an extreme example, imagine shifting all the estimates to 1 and adjusting all the CI widths so that they cover 95% of true values that researchers tend to study. Then we could have perfect calibration but zero predictive power.

Therefore, I think we need to retain a measure of predictive value to ensure that we are not losing out. I agree that we are trying to tell if the observational study results are “close” to the RCT results, but how do we define close. That’s why covering the RCT result while excluding 1 came up.

But it may be necessary to use separate two metrics. One metric can judge whether we cover 95% of RCT results (or whatever percent is appropriate given that the RCT results have their own variance). And the other metric can judge how often the observational study came to the same conclusion as the RCT.

When I say “cover the RCT result” I don’t necessarily mean cover the nominal result, but I guess test whether the observational and RCT results are distinguishable.

George

schuemie · March 14, 2017, 3:01pm

Hi @hripcsa,

Good point, I think we should just always report both coverage and predictive accuracy. Requiring that you always exclude 1 might be tricky because we may just not have enough power, but there’s nothing stopping us from including negative controls next to our RCT-derived positives to estimate AUC.

Cheers,
Martijn

schuemie · March 18, 2017, 9:29am

Summarizing our discussion at the OHDSI face-to-face:

Martijn gave a short introduction of the workgroup, and started the discussion on method evaluation. He pointed out that real positive controls are problematic because they are known, and doctors will try to mitigate the effect of the drug.

It was remarked that the current evaluation appears to focus only on detecting effects present during exposure. The notion of effects that require accumulation of exposure seems missing. Even though those effects would still fall under effects during exposure, we are indeed not focusing on such effects. They are not out of scope for OHDSI, but are probably out of scope for this evaluation since we have to focus on something, and effects during exposure are an important topic.
Whilst our negative controls can have unmeasured confounding, our synthetic positive controls are not able to preserve unmeasured confounding. One way to address this shortcoming of our methodology is to mimic missing confounding by removing data available for adjustment.
Evaluating against Randomized Controlled Trials (RCTs) is important for political reasons, but is problematic because
- RCTs themselves are likely biased due to non-random acts after the moment of randomization
- RCTs typically have limited sample size
- Even though we would like to have RCTs with observational data preceding the moment the effect was known, the effect was probably already known long before the trial
Adler Perotte may help in identifying RCTs to include in our evaluation. He has been working on codfying the inclusion and exclusion criteria for trials so they can be implemented in the CDM.
Alejandro Schuler has developed an advanced approach for simulating effects. It may be possible to use this to injection signals on top of negative controls.
Many more people have indicated they want to be involved in the task force. Martijn recommended they post their intentions on the forums, so they can be added.

schuemie · March 21, 2017, 9:46am

Let me try and wrap up this discussion.

The broad research approach we’ll take is to evaluate methods using:

Negative controls (n > 100)
Synthetic positive controls derived from these positive controls (n > 100)
Some set of RCTs (1 > n > 100)

The evaluation will focus on estimation of relative risk during exposure (as opposed to estimation of risk due to cumulative exposure)

Methods will be evaluated on a set of observational databases (to be determined, but will include the databases at JnJ).

I propose we write at least two papers:

Description of the Standardized Method Evaluation Framework (Benchmark?), demonstrated on one or two vanilla methods.
Application of the Standardized Method Eveluation Framework to a large set of methods currently being used in observational research (including new-user cohort method using propensity scores, self-controlled case series, case-control, and case-crossover), including a large set of possible analysis choices within each method.

Let me know if you agree (or not)!

hripcsa · March 21, 2017, 1:39pm

Sounds great. Are we tracking the RCTs? (E.g., Adler, Nigam?) George

schuemie · March 21, 2017, 4:03pm

Sorry for being dense. What exactly do you mean by ‘tracking the RCTs’?

hripcsa · March 21, 2017, 4:39pm

Sorry, wrong word. Creating a collection of RCTs that meet a reasonable set of criteria (size, can be implemented in OHDSI, etc.), and collect the effect sizes and variances.

George

schuemie · March 22, 2017, 9:06am

If we all agree on the overall approach, I think the next step is to identify the tasks that need to be completed. I came up with this list:

Identify exposures of interest and negative controls
Refine approach to positive control synthesis
Evaluate effect of unmeasured confounding in positive control synthesis
Identify RCTs and implement inclusion criteria*
Implement case-crossover / case-time-control
Define universe of methods to evaluate
Identify list of databases to run on
Develop evaluation metrics
Implement and execute evaluation
Write papers

One of these tasks is the one George mentioned: * creating a collection of RCTs, for which I will create a separate topic.

I would like to ask for volunteers for these task! Please start topics for any one of these if you like. You all joined the task force, so I’m expecting you must be eager to roll up your sleeves and get to work!

aschuler · March 23, 2017, 6:09am

I’ll be synthesizing both positive and negative controls.

I will also be able to contribute to the identification of methods to evaluate on, but mainly within the realm of new-user cohort methods.

Also, a quick question- what do you mean by evaluating the effect of unmeasured confounding in data synthesis?

schuemie · March 23, 2017, 1:07pm

By “evaluating the effect of unmeasured confounding in data synthesis” I meant that no matter how we decide to generate positive controls, we will always need to make some sort of assumption on the nature and magnitude of unmeasured confounding. I was thinking we could do some empirical evaluation of those assumptions, although I’m not yet sure what that would look like.

aschuler · March 23, 2017, 8:33pm

Yeah, by the nature of the question it’s impossible to evaluate it directly in real data.

SCYou · March 23, 2017, 11:34pm

Hi, @schuemie
You know I’m still a novice in this field. So I’m not sure that I can be helpful. And my English is poor. But I just wanted to write what I though about this question

Database
I want to provide data from Korean national health insurance system (NHIS). NHIS covers more than 98% of Koreans. NHIS-sample cohort database contains 2% of total population (1M) and also contains results from health examination. The converting process has been almost done. I’m planning to open the ETL queries and write a paper for this process.
Emulating RCT:
Since I am basically a clinician (cardiologist), I’m really eager to emulate RCT by using observational study.
The two most striking RCTs in 2016 were ‘HOPE-3’ and ‘SPRINT’ to me. These two RCTs met the criteria discussed earlier in this thread. They used existing drugs for patients with novel criteria. The study results were positive (, which surprised me and others).
HOPE3 : http://www.nejm.org/doi/full/10.1056/NEJMoa1600176#t=article
SPRINT : http://www.nejm.org/doi/full/10.1056/NEJMoa1511939#t=article
Or many great RCTs targeting HF-PEF(heart failure with preserved ejection fraction) had negative results. We can also replicate these RCTs as negative controls.
I really agree with @hripcsa that we need to emulate RCTs, and I’ll try to. But I’m not sure emulating RCTs can be a ‘golden standard’ to verify validity of method. Because I don’t think we can replicate the exact inclusion criteria and study protocols. But again, replicating RCTs is very important and meaningful by itself.
(Actually, I don’t think the result from replication study of SPRINT or HOPE-3 trial in Korea would be positive. Because cardiovascular risk is much lower in Asian. It would be very hard to prove beneficial effect of drugs in intermediate risk population.)
Positive / Negative controls
I agree with the idea of negative control. But we cannot pick the real negative control, because there can be many unobserved confounding factors as @schuemie said. Since the database we have doesn’t reflect the true whole medical history of patients. So I’m not sure that negative controls we pick can be served as the ‘gold standard’.
For example, ‘ingrowing nail’. I do believe that anti-hypertensive drugs are not associated with ‘ingrowing nail’. But accessibility to health care system, socioeconomic status, or worrying about one’s health can be related with ‘medical claim or diagnosis code’ with ingrowing nail.
So I agree with @aschuler 's ideas synthesizing both positive and negative controls.

Furthermore, I think that the the team for ‘positive / negative control’ and the team for ‘method development’ should be separated. If ‘method development’ team knows how positive and negative control are made, they will develop the method for this specific logic.

I can help as possible as I can