OHDSI Home | Forums | Wiki | Github

[HADES] Newbie questions migrating from custom R scripts to HADES for epidemiological research on OMOP

Hi all! My name is Nathaniel and I’m part of a team that is new to the OHDSI / HADES community. We’re so excited to learn more from the community and hopefully start contributing oursevlves soon.

First off my apologies for the long post before getting to my core questions, but I think the context will be useful. For those willing to read our story and respond, I would be grateful for your feedback!


Our team conducts epidemiological research from RWD, and builds tools to help make that process easier.

We are trying to understand how we can leverage HADES for OMOP data in an R based workflow in order to reduce the amount of code users need to write to create and analyze cohorts from OMOP.


Typically, we write analytical methods ‘from scratch’, combination of SQL and R, for each project.

Procedurally, our research typically follows a patient-level framework, whereby we iteratively create a “one-row-per-patient” patient-level table, with summarised variables, filtered down to patients who satisfy inclusion criteria.

Those summarized variables can be things like

  • (date) First date of a diagnosis of
  • (logical) Did the patient ever receive drug after
  • (character) What state is the patient from?
  • (numeric) How many inpatient visits did the patient have within days around event
  • (logical) Did the patient have at least 2 inpatient visits within a window of days around event

Each of these patient-level variables can require different methods to derive

Some are simple and can be read directly from an existing patient level table:

  • (character) Patient sex
  • (date) Patients date of death

Others are summaries that some amounts of computation, but are easy to calculate:

  • (integer) Total number of inpatient visits within a specific <date_window>
  • (date) Date of patient’s earliest treatment of

Others are the results of complex algorithms that require multiple parameters:

  • (character) What is a patient’s biomarker status for a biomarker <biomarker> at <date>, where biomarker status is defined as a summary across multiple potential biomarker values within a specific <date_window>

**What we (believe we) know **

Based on my readings of the (fantastic) OHDSI documentation, we understand the following

  • The HADES packages provide a menu of R based functionality for many steps in the cohort creation and analysis process.
  • The most relevant packages for those for constructing and evaluating cohorts, such as Capr, PhenotypeLibrary and CohortGenerator
  • Guidelines for designing a simple study can be found at (Chapter 9 SQL and R | The Book of OHDSI)
  • Community developed queries that implement common analyses and patient phenotypes are stored in locations such as the QueryLibrary https://github.com/OHDSI/QueryLibrary and the PhenotypeLibrary package


Here are some of my high level questions I’d love feedback on!

  • What are some recommended ‘end-to-end’ workflows for defining patient-level variables, where some variables require potentially and analyzing complex cohorts, with highly custom patient-level summarized variables?

    • Is capr the state of the art? Or are there other approaches?
    • Is there collection of “gold standard” recommended end to end implementations of different kinds of study designs?
  • How often should custom SQL needed to implement patient level variables?

    • Is this commonly expected or can it be avoided through HADES packages such as capr?
  • How should community developed queries be embedded in analysis scripts?

    • Do they need to be copied and pasted, or can they be called programmatically through functions (i.e.; through wrapper functions that find, edit, and embed the query)
    • Is there functionality in HADES (or somewhere else) to access and customize these queries through functions?
  • What is the relationship between ATLAS and HADES?

    • Does ATLAS use HADES under the hood? Or are they completely independent code-bases?
1 Like

Welcome to the community @Nathaniel_Phillips! I can help with some of your questions.

  • Is capr the state of the art? Or are there other approaches?

Capr is an R package for defining Cohorts in R code and exactly matches the options available in Atlas. It will not give you patient level data but is only used to express the definition of the cohort you’re interested in using in your analysis. I would say the main alternative to Capr is Atlas. Capr is probably better if you have 100 cohorts to make and Atlas is probably better if you have a small number of possibly complex cohorts.

I’ll leave the gold standard study designs question to others but just mention that you might look at Developing patient level prediction using data in the OMOP Common Data Model • PatientLevelPrediction and New-User Cohort Method with Large Scale Propensity and Outcome Models • CohortMethod as a couple examples.


For patient level datasets you can use FeatureExtraction to quickly get large covariate sets.

How should community developed queries be embedded in analysis scripts?

I’d probably suggest embedding community developed queries in your study codebase unless that SQL is part of an analytic package that has a release and a maintainer. i.e. you can “import” a package/function but for some sql script you should copy and paste it into your study.

What is the relationship between ATLAS and HADES?

Atlas and Hades share common baseline components. These would be SqlRender for SQL translation and Circe-be for cohort SQL generation. Atlas has some functionality that does not exist in Hades and Hades has quite a lot of functionality that is not available in Altas. Basically there are some lower level java libraries that are imported by both Hades and Atlas, the main ones being SqlRender, Circe, and FeatureExtraction. Hades development tends to move faster than Atlas development.

The Hades developers have been working on a new framework called Strategus for modularized studies that kind of wraps various Hades packages so that’s something to look at.

I’ve been working on a package for Darwin called CDMConnector that provides a low code interface to the CDM you can use to get patient level data with another package called PatientProfiles. These packages are not part of Hades because they have overlapping functionality with existing packages.

In short there are lots of tools out there (on github). Happy to help you get the lay of the land and figure out what will work best for your use cases.


HADES aims to promote (enforce?) best practices, so in general I recommend following HADES workflows. So here’s how I would answer this question:

How often should custom SQL needed to implement patient level variables?

Ideally never. You can define populations of interest (cohorts) using Capr or ATLAS cohort definitions. FeatureExtraction has a large collection of standard patient variables (aka ‘features’ or ‘covariates’) it created by default for your cohorts. If you want something not included in FeatureExtraction’s default set, you can now create features based on cohorts (as described here).