OHDSI Home | Forums | Wiki | Github

How do we define what is a 'good study'?

I just ran into this recent paper comparing the various guidelines for observational studies. Even though the results are not surprising, it made me wonder how OHDSI might frame its recommendations for observational research.

If it were up to me, all observational research would

  • be fully reproducable, meaning that at least the full analysis code should be available as open source (preferably also the data itself would be publicly available)
  • use negative controls and p-value calibration to quantify residual bias
  • use proven and tested methods, such as our CohortMethod package (preferably using an OMOP-like experiment to find the best method and analysis settings for the problem and data at hand)

But who am I to dictate this? Is there a process by which we can come to the OHDSI recommendations for good studies? And how would be communicate these recommendations? By creating yet another guideline?

I think it would be good to post OHDSI guidelines on our website.

I would add - have a public, pre-specified analysis plan

This is a great thread @schuemie, thanks for starting it. I agree with
everything @schuemie and @David_Madigan said. I’d probably add a few other
recommendations:

  • Produce and publicly share all appropriate analysis diagnostics (e.g. in
    CohortMethod, outputs like the covariate balance summary, propensity score
    distributions, etc.). This raises the question of what additional
    diagnostics can the OHDSI community develop to improve the reliability of
    observational studies, which seems a worthy pursuit in its own right.

  • Clearly pre-define all hypotheses to be tested. The statistical plan
    should clearly communicate how multiple testing is being accomodated if
    statements of statistical significance are intended to be made.

  • Produce and publicly share all marginal statistics generated by the
    standard analysis packages. All too often I see publications that only
    share the hazard ratio and 95% confidence intervals, without the full
    summary of the population, number of exposed, number of events,
    time-to-event distribution, etc. At a minimum, all such material should be
    provided in a publication appendix.

In terms of how we disseminate these recommendations, I agree with David
that we can start by posting it on the OHDSI website and making sure that
all OHDSI network studies conform to these minimal standards. If we
determine that they are truly useful in guiding research and development,
then perhaps they will warrent their own publication, but I’d think having
several studies to point to using the guidelines would be the necessary
first step.

Friends:

Actually, that would be wonderful and extremely apropos vis-à-vis the IMEDS project I am exposed to testing the OMOP 2010 experiment against pharmacoepidemiological expertise.

However, as somebody who looks at this somewhat from the sidelines (I am really not running studies): The problem with these usual FDA-style “You need to disclose the following”, followed by a checklist of vaguely defined pieces is that people like to fill it with a lot of fluff. However, In my opinion it the whole problem is this:

  1. Declare what you want to measure
  2. Declare how you want to avoid bias. The latter can be done:
  • By clever data selection
  • By some smart algorithm trying to measure and undo, or avoid, the bias
  • By measuring some of the bias and making it transparent
  • By guessing what might happen on the basis of some clinical expertise

So, things like “statistical analysis plan” is probably not sufficient.

Anything we can do to make this more explicit, instead of just adding another checklist?

I agree it is important to be very precise beforehand about what it is that you are going to do. We might even require that the full study code must be ready and posted before you can launch the actual study. (You can use the negative controls to test whether everything is working).

But in addition to that, I think OHDSI studies should always include empirical evidence that the most important assumptions are met. This evidence can be equipoise in a propensity score distribution, covariate balance to show comparability has been achieved, or a negative control distribution to show that bias has been dealt with, etc.

We can take this even further by requiring that even small assumptions, like that a piece of code is doing what it is supposed to do, should be covered by some evaluation (i.e. a unit test).

+1 to @schuemie for bringing up this excellent topic. I agree with the sentiments of let’s be explicit but also let’s not invent a new checklist.

Re: the latter, it is is worth scanning the appendix to the paper, where they list out the specific recommendations from each source. At the very least, this should ameliorate the urge to come up with a bunch of new ways of stating these same principles. Perhaps the OHDSI principles would be affirming individual elements from these sources, adding only what is OHDSI-specific. In other words, we are not proposing a new checklist of principles for observational research writ large (i.e., all observational research should…), but rather are just stating our internal benchmarks for OHDSI studies.

And as @Patrick_Ryan noted, once we’ve gotten enough network studies under our belt, we can come out with a published set of recommendations that have been further refined by this work.

Protocol for a well-defined study may differ by purpose: signal detection, refinement or confirmation ( I think this is the current language if not replace with updated nomenclature).

I would certainly be willing to work on a OMOP/OHDSI best-practices document if there was an interested party.

I come off as antagonistic sometimes, so I want to just state that I think the CohortMethod package is excellent. That said, how would we define a proven and tested method? Is the CohortMethod package well-validated or are the individual techniques well accepted within our profession?

I wonder how we would think about the p-value calibration issue if we were doing a single CER study between 2 drugs with perceived equipoise. Are you suggesting that any new CER study should be accompanied by a standard battery of “known” effects to prove that you can use your data and tools to reproduce expected results? This is an interesting idea that deserves more attention.

I’m very happy to see everyone participating in this discussion!

@jon_duke: Yes, I agree we shouldn’t reinvent the wheel, and steal from existing lists where we agree with them. Also agree with @Patrick_Ryan’s point that we should practice what we preach for some time before we make our recommendations ‘official’.

@bcs: I was thinking we could have general recommendations that apply to all types of studies, and specific recommendations for specific types of studies. I dislike the detection-refinement-confirmation classification (but that is another discussion), it seems to make more sense to classify by study design (cohort method, SCCS, etc.)

We already have a demonstration of using negative controls in a CER setting, as you can see in this CohortMethod vignette. And yes, I think all studies should use negative controls.

The reason I mentioned CohortMethod as a ‘proven and tested method’ is because it uses various mechanisms for validation, including unit tests both of the package itself and those included in the Cyclops package. (it is far from complete though, writing unit tests is hard work).

Maybe we should first define some overall principles that we think are important, and turn these into concrete ‘rules’ later on. I would say the overall principles are:

  1. Transparency: others should be able to reproduce your study in every detail
  2. Be explicit upfront what you want to measure and how: this will avoid hidden multiple testing (fishing expeditions, p-value hacking)
  3. (Empirical) validation of your analysis: you should have evidence that your analysis does what you say it does (showing that statistics that are produced have nominal operating characteristics (e.g. p-value calibration), showing that specific important assumptions are met (e.g. covariate balance), using unit tests to validate pieces of code, etc.)

This discussion will be continued in the Population-level estimation workgroup meeting.

The results of those discussions will be documented here over the course of time.

t