OHDSI Home | Forums | Wiki | Github

How to get attrition table and chart in HADES?

Hi all! I’m trying to understand how to access a cohort attrition table and chart in R once a cohort has been generated using CohortGenerator::generateCohortSet() but I can’t seem to figure it out.

By “attrition table” I mean the table I’m looking for is one that has one row per inclusion criteria, and shows how many patients remain after applying that criteria. I know that ATLAS provides an attrition chart representing these data, but now I’m trying to understand how to (re)-create it in R using HADES

I tried looking in the CohortGenerator package for an example of this but couldn’t find one. I hoped it would be a result from Used to get a list of cohort table names to use when creating the cohort tables — getCohortTableNames • CohortGenerator but the outputs don’t look quite right to me.

If anyone could point me towards an example of how to access an attrition table that would be great, thank you!

Tagging @Adam_Black and @mdlavallee92 who I think might know!

1 Like

Following up on my own post, it seems like this question has also been raised in https://github.com/OHDSI/CohortGenerator/issues/79#issuecomment-1702660257. Looks like it still hasn’t been resolved.

But now thanks to @Adam_Black we can figure out the binary logic to extract what we need, trying to figure out how to functionalize that now. If we can get it working we’ll happily submit a PR!

Hi @Nathaniel_Phillips,

I’ll try to explain as best I can but I don’t fully understand the logic behind the inclusion rule mask yet. It just hasn’t “clicked” for me yet. But here we go…

Let’s use a cohort from the phenotype library “Febrile seizure”.
https://atlas-demo.ohdsi.org/#/cohortdefinition/1785335/definition

This cohort has two inclusion rules.

I generated this cohort on a test database I use with the following code.

library(DatabaseConnector)
library(dplyr)

connectionDetails <- createConnectionDetails(
  dbms = "sql server",
  server = Sys.getenv("CDM5_SQL_SERVER_SERVER"),
  user = Sys.getenv("CDM5_SQL_SERVER_USER"),
  password = Sys.getenv("CDM5_SQL_SERVER_PASSWORD"),
  port = Sys.getenv("CDM5_SQL_SERVER_PORT")
)

connection <- connect(connectionDetails)

cohortDatabaseSchema <- Sys.getenv("CDM5_SQL_SERVER_SCRATCH_SCHEMA")
cdmDatabaseSchema <- Sys.getenv("CDM5_SQL_SERVER_CDM_SCHEMA")

# just generate one cohort
cohortDefinitionSet <- PhenotypeLibrary::listPhenotypes() %>%
  dplyr::pull("cohortId") %>%
  PhenotypeLibrary::getPlCohortDefinitionSet() %>%
  dplyr::filter(cohortId == 54)

readr::write_file(cohortDefinitionSet$json, here::here("Febrile seizure.json"))

cohortTableNames <- CohortGenerator::getCohortTableNames(cohortTable = "pl_cohort")

# Next create the tables on the database
CohortGenerator::createCohortTables(
  connectionDetails = connectionDetails,
  cohortTableNames = cohortTableNames,
  cohortDatabaseSchema = cohortDatabaseSchema,
  incremental = FALSE
)

# Generate the cohort set
CohortGenerator::generateCohortSet(
  connectionDetails = connectionDetails,
  cdmDatabaseSchema = cdmDatabaseSchema,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTableNames = cohortTableNames,
  cohortDefinitionSet = cohortDefinitionSet,
  incremental = FALSE
)

CohortGenerator::getCohortCounts(
  connectionDetails = connectionDetails,
  cohortDatabaseSchema = cohortDatabaseSchema,
  cohortTable = "pl_cohort") %>%
  tibble()

We can see the cohort ends up with 18 entries (rows) and 18 subjects (people). Remeber that a person can have multiple entries in the same cohort since they can enter, exit, and then reenter at a later time.

Next I will download the cohort inclustion result table from the database.


con <- connect(connectionDetails)

# Bring the inclusion result table to R memory
inclusionResult <- querySql(con, glue::glue("select * from {cohortDatabaseSchema}.pl_cohort_inclusion_result")) %>%
  dplyr::rename_all(tolower) %>%
  dplyr::mutate(inclusion_rule_mask = as.numeric(.data$inclusion_rule_mask)) %>%
  tibble()

inclusionResult

So this table is a bit cryptic to understand but contains the information you need to build an attrition table. By summing various combination of rows in this table you can get at all the various combinations of inclusion rules. It will allow you to calculate the number of subjects and records for any combination of inclusion rules.

Now my colleague Marti wrote a nice helper function to give you the inclusion rule mask ids for each inclusion rule. And actually Marti deserves the credit for all of this investigative work and spent a lot of time trying to understand these cohort tables. The function is

getInclusionMaskId <- function(numberInclusion) {

  inclusionMaskMatrix <- dplyr::tibble(
    inclusion_rule_mask = 0:(2^numberInclusion - 1)
  )

  for (k in 0:(numberInclusion - 1)) {
    inclusionMaskMatrix <- inclusionMaskMatrix %>%
      dplyr::mutate(!!paste0("inclusion_", k) :=
                      rep(c(rep(0, 2^k), rep(1, 2^k)), 2^(numberInclusion - k - 1))
      )
  }

  lapply(-1:(numberInclusion - 1), function(x) {
    if (x == -1) {
      return(inclusionMaskMatrix$inclusion_rule_mask)
    } else {
      inclusionMaskMatrix <- inclusionMaskMatrix
      for (k in 0:x) {
        inclusionMaskMatrix <- inclusionMaskMatrix %>%
          dplyr::filter(.data[[paste0("inclusion_", k)]] == 1)
      }
      return(inclusionMaskMatrix$inclusion_rule_mask)
    }
  })
}

You pass in the number of inclusion rules you have and it will return a list of the same length with the mask ids you need to use.
Here is some example output.

Next I’ll extract the names of the inclusion rules from the cohort definition.

# get the text descriptions of the inclusion rules
cohortDefinition <- jsonlite::fromJSON(cohortDefinitionSet$json[1], simplifyVector = FALSE)
str(cohortDefinition, max.level = 1)
inclusionNames <- purrr::map_chr(cohortDefinition$InclusionRules, "name")
inclusionNames

Create a list of dataframes using the appropriate inclusion rule mask ids


numberInclusionRules <- length(inclusionNames)
inclusionMaskId <- getInclusionMaskId(numberInclusionRules)
inclusionNames <- c("Qualifying initial records", inclusionNames)
attrition <- list()

id <- 54 # cohort definition id

for (k in 1:(numberInclusionRules + 1)) {

  number_records <- inclusionResult %>%
    dplyr::filter(.data$cohort_definition_id == id) %>%
    dplyr::filter(.data$mode_id == 0) %>%
    dplyr::filter(.data$inclusion_rule_mask %in% inclusionMaskId[[k]]) %>%
    dplyr::pull("person_count") %>%
    base::sum()

  number_subjects = inclusionResult %>%
    dplyr::filter(.data$cohort_definition_id == id) %>%
    dplyr::filter(.data$mode_id == 1) %>%
    dplyr::filter(.data$inclusion_rule_mask %in% inclusionMaskId[[k]]) %>%
    dplyr::pull("person_count") %>%
    base::sum()

  attrition[[k]] <- dplyr::tibble(
    cohort_definition_id = id,
    number_records = number_records,
    number_subjects = number_subjects,
    reason_id = k,
    reason = inclusionName[k]
  )
}

Finally bind these together and compute the number dropped at each step.

attrition <- attrition %>%
  dplyr::bind_rows() %>%
  dplyr::mutate(
    excluded_records =
      dplyr::lag(.data$number_records, 1, order_by = .data$reason_id) -
      .data$number_records,
    excluded_subjects =
      dplyr::lag(.data$number_subjects, 1, order_by = .data$reason_id) -
      .data$number_subjects
  ) %>%
  dplyr::mutate(
    excluded_records = dplyr::coalesce(.data$excluded_records, 0),
    excluded_subjects = dplyr::coalesce(.data$excluded_subjects, 0)
  )

attrition

And voila! that’s all you need to do! There might be an easier way to do this but I’m not sure. And I’m sure I botched this explanation pretty good. Maybe @Chris_Knoll or @anthonysena can fill in the gaps or correct anything I got wrong.

Great question!

In the Cohort output, there are three distinct tables that complement the main cohort table. The main cohort table as you has cohort_definition_id, subject_id, cohort_start_date, cohort_end_date. The supporting tables are as follows:

  1. cohort_inclusion ("#cohort_inclusion"): This serves as the rule metadata table, containing rule names and descriptions specific to each cohort. This table helps in identifying which rule is being examined.
  2. cohort_inclusion_stats ("#cohort_inc_stats"): This is essentially the result table, containing fields such as person_count, gain_count, and person_total. The table uses cohort_definition_id as the entity and attributes like rule_sequence and mode_id.
  3. cohort_inclusion_result ("#cohort_inc_result"): Another result table, this is where the ingenuity of @Chris_Knoll becomes apparent. Here, the entity is cohort_definition_id while the attributes include mode_id and inclusion_rule_mask. The value column contains person_count.

Regarding mode_id: The attribute mode_id has two key values:

  • mode_id 0: Denotes all events
  • mode_id 1: Indicates the “best event,” which represents the single event per person meeting the highest number of inclusion criteria.

The cohort_inclusion_result uses inclusion_rule_mask to summatively represent which of the inclusion rules were met for a particular combination. For example, an inclusion_rule_mask of 3 means that rules 1 and 2 were met, and this will be represented in the person_count column for the respective mode_id values. How do I know that:

as.integer(intToBits(0)) = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i.e. none of max 32 inclusion rules were met.

as.integer(intToBits(3)) = 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 i.e. met rule 1 and 2.

I continue to be amazed by the brilliance of this approach, as initiated by @Chris_Knoll. It by using bitmask numbers to communicate the status of up to 32 inclusion rules efficiently.

Now we can use bit operators in very interesting ways to find count of persons who met only rule 3, or count of persons that met rules 1, 2 but not 3. etc. I heard there is some cool work in OHDSI ShinyModules to use the output to create various types of visualizations of attrition. I cant wait for that innovation.

1 Like
t