FeatureExtraction 2.0

schuemie · June 30, 2017, 1:58pm

The FeatureExtraction package is used extensively in the Methods Library as well as the PatientLevelPrediction package, and so it is time to think how we can make it more effective. We have a growing wish list:

Would it be possible to do without the removal of redundant covariates?
Could FeatureExtraction support characterization of cohorts? And could per-person statistics be skipped if not needed for efficiency?
A more flexible approach to defining the time window when covariates are captured.
More flexibility in adding and maintaining features.

Leaving the first wish aside for now, I’ve come up with a new architecture that should address the remaining issues.

At the core of the proposed architecture is a set of SQL files like this, this, and this. These files are a bit complex, but the advantage of this code is that it can be used both to generate per-person statistics, as well as aggregated statistics for the entire cohort, and can be used to generate both regular covariates (one time period per analysis ID) and temporal covariates (multiple time periods per analysis ID). Also, even for the regular covariates the time period is parameterized. By having all this functionality in one place, we can more easily guarantee consistency.

The new covariate settings objects will now just be a data frame with references to these SQL files, for example:

Such settings can be generated using the new createCovariateSettings function which can be used just as the old one, or people can generate this data frame ‘manually’ for more flexibility.

The new getDbCovariates function now has arguments aggregated and temporal to switch between the type of covariate data to create.

Adding custom covariates is now easier because one can reference additional SQL files when calling FeatureExtraction. I would also allow people to ensure there are no conflicts in analysis IDs and covariate IDs (rather than having FeatureExtraction resolve conflicts automatically), therefore increasing reproducability.

Looking forward to everyone’s thoughts on this, including @Patrick_Ryan, @Rijnbeek, @jennareps, and @Frank.

Eldar · July 1, 2017, 1:00pm

Can you please explain the difference between regular and temporal covariates?
As I understood,for any condition_occurrence we could create several various covariate_id’s with different window periods (30,180,365,all time prior etc.) What is changed now?

schuemie · July 2, 2017, 6:46am

What I call ‘temporal covariates’ is simply for when you want lots of time windows (e.g. one for every day in the year prior). In that case, we add a timeId column to the covariate table, and use the same covariate ID across all time windows. There is a class of prediction algorithms for which that makes sense.

Eldar · July 4, 2017, 4:12pm

Now clear, thank you!
‘Adding custom covariates is now easier because one can reference additional SQL’
This is really interesting idea. In such way it’ll be also possible to find probably important covariates and therefore to implement them as default ones.

schuemie · July 5, 2017, 8:09am

Indeed! The idea is to allow the community to contribute to an ever growing set of available covariates. I think I would still like to define a default set which may not include all, but users will be able to ‘mix and match’ depending on their needs.

Gowtham_Rao · August 25, 2017, 1:47am

@schuemie this is awesome. Have you thought about how we should work on contributing code and also creating new analysis_id and covariate_id without causing conflicts

schuemie · August 25, 2017, 2:38pm

Yes, good question.

First, just to be clear: version 2 is still under development, so it is not possible to add analyses at this time. I will announce when we launch version 2.

I would imagine adding a new feature would roughly follow these steps:

Announcing the idea for the new feature on the issue tracker.
If needed, implement and add a new template SQL file such as this one. Given the complexity of these files, I would imagine you’d need help with that.
Adding the analysis to the list of prespecified analyses, manually picking a unique analysis ID to prevent collisions.

Note that for features that do not fit the pattern of template SQL + parameters there is always still the option to create custom covariate builders as documented in the package vignettes.

schuemie · October 23, 2017, 12:28pm

Version 2 is now close to ready for release. I just created a first version of the vignette, and would appreciate any feedback.

@Gowtham_Rao: if you have analyses you would like to add I think it is safe to do so now using the process I described earlier.

Patrick_Ryan · October 23, 2017, 2:02pm

Great work @schuemie . This is an important step forward that will
directly support the community’s characterization, estimation, and
prediction efforts.

An important feature here, that was not in the prior version without
creating custom covariates, was actually the subject of discussion in the
Patient-level Prediction tutorial last Friday. The prediction we worked on
as a group was: Amongst T=persons admitted to hospital with pneumonia,
which patients were O=required to have service in the ICU during TAR=the
same inpatient episode? We initially ran this model with all baseline
covariates, but since the prior version used day 0 as part of the covariate
time window, we observed diagnoses that were likely part of the ICU
sequence, rather than prior comorbidities. Having the ability to set the
time windows, as you show in the vignette at the bottom on page 2,
createCovariateSettings, is exactly the solution we’d require; by setting
endDays = -1, we’d limit ourselves to only baseline covariates observed
prior to the hospitalization start.

Chris_Knoll · October 23, 2017, 2:52pm

@schuemie,
Would you like requests posted here, or new git issues created for requests?

schuemie · October 23, 2017, 3:32pm

Please post requests in the GitHub issue tracker.

Gowtham_Rao · October 24, 2017, 10:18am

@schuemie - will get on this ASAP!

This is awesome. And thank you!

SCYou · October 25, 2017, 9:18am

Thanks for wonderful job, @schuemie
I have three questions.

Can you tell me when will this cool version of feature extraction package be released roughly? Because my current work is very dependent on this package (especially for temporal features).
Recently, we’ve developed genetic CDM compatible with existing OMOP-CDM (Three table for genetic information were added in our mode). Now we’re working on building some prediction algorithms by using combined information of clinical and genetic data in CDM. I’m thinking about using feature extraction package with custom covariate builder to extract genetic information from these newly added tables and columns. It is possible to extract information from user-defined table in CDM by using feature extraction package version 2?
In previous temporal feature branch, I found that the package extract information after cohort_start_date when I used negative start/end day whereas the expected behavior was extracting features before cohort_start_date (I should’ve used positive start/end day because of this issue). Is this problem fixed in the new version?

The example code of temporal feature branch was below:

SQL code
–condition era starts in time period
SELECT DISTINCT cp1.@row_id_field AS row_id,
CAST(ce1.condition_concept_id AS BIGINT) * 1000 + 101 AS covariate_id,
tp1.time_id AS time_id,
1 AS covariate_value
INTO #cov_co_start
FROM @cohort_temp_table cp1
INNER JOIN @cdm_database_schema.condition_era ce1
ON cp1.subject_id = ce1.person_id
INNER JOIN #time_period tp1
ON DATEDIFF(DAY, ce1.condition_era_start_date, cp1.cohort_start_date) <= tp1.end_day
AND DATEDIFF(DAY, ce1.condition_era_start_date, cp1.cohort_start_date) >= tp1.start_day
WHERE ce1.condition_concept_id != 0
{@has_excluded_covariate_concept_ids} ? { AND ce1.condition_concept_id NOT IN (SELECT concept_id FROM #excluded_cov)}
{@has_included_covariate_concept_ids} ? { AND ce1.condition_concept_id IN (SELECT concept_id FROM #included_cov)}
;

Thank you for all your contribution!
Chan

schuemie · October 25, 2017, 11:14am

Hi @SCYou,

To answer your questions:

We are in the process of finalizing the package and dependencies (CohortMethod and PatientLevelPrediction). I think we should be ready in the next two weeks, at which time we will cut over all three packages at once.
FeatureExtraction is designed to work against the CDM. So the cleanest path forward would be to get your additions for genetic data in the CDM specifications, which you will have to discuss on the CDM issue tracker. Alternatively, you could write a custom covariate builder for now.
Yes, for consistency all ‘days’ arguments are now relative to the cohort_start_date, so negative means going back in time and positive means going forward in time.

SCYou · October 25, 2017, 11:29am

@schuemie Thank you for answering my questions. It’s really helpful!

schuemie · October 30, 2017, 3:04pm

Note: We will release FeatureExtraction 2.0 this week (probably Wednesday). At the same time we will release new versions of PatientLevelPrediction and CohortMethod, as the current versions of those packages are not entirely compatible with the new version of FeatureExtraction.

For those of you who still depend on the current versions of those packages, note that at any time you can install older versions of OHDSI packages through drat. For example, if somewhere in the future you would like to install the current version of FeatureExtraction you can use

library(devtools)
devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/FeatureExtraction_1.2.3.tar.gz")

Similarly, you can install the current versions of PatientLevelPrediction or CohortMethod using:

devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/CohortMethod_2.4.4.tar.gz")

devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/PatientLevelPrediction_1.2.2.tar.gz")

(Looping in @jennareps, @Rijnbeek, @jweave17)

schuemie · November 14, 2017, 10:08am

FeatureExtraction V2.0.0 has now been released!

Please see the vignette for details on how to use the new version.

Note that new versions of CohortMethod, CaseControl, and PatientLevelPrediction will also be released today for compatibility with the new version of FeatureExtraction…

SCYou · November 14, 2017, 3:05pm

Thank you for your contribution @schuemie!
I love 'createTable’function!! I want to modify my hypertension combination package to use this function, too

Great job!!

Gowtham_Rao · July 1, 2018, 1:38pm

@schuemie could you please help us understand how to use createCovariateSettings or Custom Covariate Builder when there are more than one period per subject_id.

rowIdField: The name of the field in the cohort temp table that is to be used as the row_id field in the output table. This can be especially usefull if there is more than one period per person.

@param rowIdField The name of the field in the cohort temp table that is to be used as the row_id field in the output table. This can be especially usefull if there is more than one period per person.

[From vignette][1]

getDbLooCovariateData <- function(connection,
                              oracleTempSchema = NULL,
                              cdmDatabaseSchema,
                              cohortTable = "#cohort_person",
                              cohortId = -1,
                              cdmVersion = "5",
                              rowIdField = "subject_id",
                              covariateSettings,
                              aggregated = FALSE) {
 writeLines("Constructing length of observation covariates")
 if (covariateSettings$useLengthOfObs == FALSE) {
return(NULL)
 }
if (aggregated)
stop("Aggregation not supported")

# Some SQL to construct the covariate: 
sql <- paste("SELECT @row_id_field AS row_id, 1 AS covariate_id,",
           "DATEDIFF(DAY, observation_period_start_date, cohort_start_date)",
           "AS covariate_value",
           "FROM @cohort_table c",
           "INNER JOIN @cdm_database_schema.observation_period op",
           "ON op.person_id = c.subject_id",
           "WHERE cohort_start_date >= observation_period_start_date",
           "AND cohort_start_date <= observation_period_end_date",
           "{@cohort_id != -1} ? {AND cohort_definition_id = @cohort_id}")

How can we say that rowIdField is subject_id - cohort_start_date
[1]: https://github.com/OHDSI/FeatureExtraction/blob/master/vignettes/CreatingCustomCovariateBuilders.Rmd

schuemie · July 2, 2018, 9:45am

Hi @Gowtham_Rao. If the subject_id field in your cohort table is not unique (for the cohort of interest), you’ll need to construct a column with values that is unique, and point FeatureExtraction to that column.

The way I usually do this is by using the ROW_NUMBER function in SQL, as you can see in this SQL in CohortMethod. Later, if you want. you can merge the features created by FeatureExtraction with the cohort table with the row_id column to link back to specific subjects and cohort start dates.