Interpreting pathway analysis tables under results schema

Chris_Knoll · March 2, 2020, 4:11pm

Yes, the pathway_anlaysis_stats table is simply a summary statistic table. The pathway_analysis_generation_id is an identifier that is generated from our ‘batch job’ subsystem which manages all of our background tasks (cohort generation, characterization, incidence rates, pathway analysis). The reason you see gaps in this table is that the same sequencer for generations is used across the different generation types. So, the missing values you see were used in some other generation (like cohort generation).

This table is a lookup table to give you the name of the event cohort or the combinations of event cohorts. So you join the code column with the combo_id column of pathway_analysis_events. the is_combo flag tells you that this code represents a combination of multiple event cohort codes. This is so you can easily find the stand-alone event cohorts vs. the combination cohorts.

The way these are defined are as follows:
The event cohorts are ordered, and indexed starting at 0.
The stand-alone event cohort codes are calculated as POWER(index,2), ie 1,2,4,8,16,32…
The cohort events are constructed and split up to determine the overlapping periods. See this post for details on that.
To ‘combine’ the event cohorts into a new combo_id, we SUM(combo_id) group by person_id, start_date, end_date. This results in a binary addition of the different powers-of-two comboIDs such that if you have the following combo IDs from 2 event cohorts:

Combo Example
comboId	Event Cohort 1	Event Cohort 2	Combo Name
1	Yes	No	Event Cohort 1
2	No	Yes	Event Cohort 2
3	Yes	Yes	Event Cohort 1 + Event Cohort 2

If you work this out in binary, Event Cohort 1 is 01, EC2 is 10, combining those together (via adding them together) results in 1+2 = 3 = 11 (in binary). We leverage this function of binary addition to create the combos.

This is the raw pathway event table which tells you for a given person which event cohort appeared in which order and what combinations were present at the time. you can use it to filter on specific people and combos.

Pathway_analysis_paths simply takes the data from pathway_analysis_events and makes a ‘wide’ table (up to a path length of 10). this is just for simplicity of retrieving the data for the analysis. On large cohorts, this is actually quite expensive, so we build this ‘report-ready’ table from the raw events. You are correct that if you only have max of 3 path-length, then only up to step_3 will be populated.

This is no R package to execute this, since all this functionality is bundled together with the WebAPI Java package. However, it wouldn’t be unreasonable to split off the functionality of pathway analysis into a stand-alone Java library, wich could be invoked in R or Java (ie; a dependency for WebAPI).