OHDSI Community call, today, 3Mar2015

Patrick_Ryan · March 3, 2015, 1:46pm

Team:

On today’s call, we’ll discuss the current state and potential future directions in the use of the COHORT, COHORT_DEFINITION (and by extension, COHORT_ATTRIBUTE and COHORT_ATTRIBUTE_DEFINITION) tables.

There’s been a lot of exciting progress from the CIRCE working group and the HERACLES working group, both of which are heavily relying on the existence of a standardized data structure to represent patients that satisfy a set of inclusion criteria for a duration of time. Our COHORT table, as defined in OMOP CDM v4 and CDM v5 specifications, is that standardized data structure, and is efficiently represented as: COHORT_ID, SUBJECT_ID, COHORT_START_DATE and COHORT_END_DATE. CIRCE (latest UI available at: http://ohdsi.org/web/circe) is a standardized tool to provide a user interface, standardized syntax to present cohort definitions, and a compiler to produce platform-independent SQL to instantiate the cohort. HERACLES takes a cohort as an input to produce standardized summary statistics about the cohort, in much the same way that ACHILLES produces standardized summary statistics about the entire database.

Beyond the standardized applications, we know several of you in the community are building your own cohorts for other purposes. @amatcho provided a good example within CPRD for how to standardize HES data, of which only a subset of people qualify for a shorter period of time than their overall observation period. Our colleagues at Erasmus have been considering how to apply a new cohort to define the subset of time that they have confidence in the reasearch-readiness of the data. Others have been thinking about how to pre-define cohorts as part of their ETLs for diseases that we all know we need to use over and over again, like diabetes, etc.

A question that has cropped up is where is the best place(s) for these data structures to reside, and how should our analyses accommodate these location(s). On the one side, storing everything in the CDM keeps patient-level data consolidated. On the flip side, some don’t have write permissions to the CDM, and given the more dynamic nature in which these data structures may be used, it could be worthwhile to store a copy of the objects in the OHDSI application schema. Given that this could have material consequences on future standardized application development as well as ETL conventions, I think it’d be a useful conversation to raise with the group to hear everyone’s perspectives.

As time permits, other topics with ongoing activity:

Registration is now closed for the OHDSI F2F meeting at Stanford. We should have a productive session with the ~25 folks who will be in attendence, it’ll be good to roll up our sleeves to dive into some hard topics.
AMIA Annual Conference submissions: there’s threads around a Systems Demonstration and a paper on the OHDSI infrastructure. A couple weeks left before that deadline.

Cheers,

Patrick

amatcho · March 3, 2015, 7:56pm

SEER-Medicare data provides another good use case for retaining the cohort table in the CDM structure. SEER-Medicare provides data cuts based on cancer type requested based on an approved protocol. In a ETL currently in development I’m planning on using the cohort table to identify which cancer cohort a patient belongs to. So this ‘cancer-type’ cohort is not based on overlapping observation periods but is actually a disease based cohort.

lee_evans · March 3, 2015, 9:39pm

I had a few thoughts to add, related to the interesting cohort and cohort definitions discussion today.

It was great that the difference between a cohort (group of people) and a cohort definition (specification of a cohort) was called out in the meeting today. I feel it’s an important distinction in this context.

A cohort definition can be shared/re-used and applied to multiple data sets to generate the resulting cohort. The resulting cohort is tied to the data set where the cohort definition was applied.

A cohort definition can be defined, extracted/exported, transported and imported in some cohort definition domain specific language in e.g. XML/JSON or as SQL.

Cohorts could be manually created via SQL for a specific study, created (perhaps ephemerally) by an individual OHDSI tool for visualization, used as an indirect mechanism of data transfer between tools, or (re)-generated automatically on a periodic basis as part of an ETL process. I think these multiple paths for creation and use of cohorts is an important aspect of this discussion.

I noted down a few scenarios for sharing/re-using cohort definitions with differing scope:

‘Gold standard’ OHDSI cohort definitions - shared across the whole OHDSI network, (with an OHDSI release schedule for updates?). An OHDSI phenotype library may be a good candidate.
OHDSI network study cohort definitions - shared per OHDSI network study with the participating OHDSI study researchers (today I believe these definitions are embodied within the network study protocol R/SQL scripts)
Organizational cohort definitions - shared across multiple researchers/departments within an individual organization
Internal organizational study cohort definition - used by a study team within an organization, specific to an internal study and archived along with the study results for reproducibility.
Individual researcher cohort definitions - used just by a single researcher

Appropriate access/security model and ‘OHDSI network/organization/individual namespace’ considerations are needed to manage the separate scope of each scenario

jon_duke · March 3, 2015, 10:07pm

Amy, this is a good example for the cohort table. Would you store the cancer cohort_definitions in the CDM as well? Per Chris/Frank, the CDMs are wiped clean when the data are refreshed, so this would seem to be a risky place for maintaining the definitions.

Patrick_Ryan · March 3, 2015, 10:15pm

Yes I think we would, because the definitions would be part of our etl
package.

Mark_Danese · March 3, 2015, 10:22pm

I think this is a great idea.

The potential problem on SEER Medicare data is that people can be in more than 1 cohort. A person can have multiple primary tumors and be in the (for example) breast and colorectal cancer cohorts. So, if they are ETL’d separately with a cohort indicator in the cohort table, that person’s records will be duplicated (assuming similar time periods on the data request). So, one would need to do some de-duplication as part of the ETL process to make sure that person doesn’t, as an extreme example, die twice. (People have a surprising number of primary cancers – this is not as rare an event as one might assume.)

The SEER Medicare data are provided tumor-by-tumor. However, it may be that if you request multiple tumors together, there is no duplication. I need to check on that.

Patrick_Ryan · March 3, 2015, 10:32pm

Yes, it’s totally fine for a person to belong to multiple cohorts, or to
belong to the same cohort multiple times. The ETL would have the ensure
the data for other table are not redundantly loaded, but I don’t see that
as being an obstacle.

Cheers,

Patrick