OHDSI Home | Forums | Wiki | Github

FeatureExtraction 2.0

Indeed! The idea is to allow the community to contribute to an ever growing set of available covariates. I think I would still like to define a default set which may not include all, but users will be able to ‘mix and match’ depending on their needs.

@schuemie this is awesome. Have you thought about how we should work on contributing code and also creating new analysis_id and covariate_id without causing conflicts

Yes, good question.

First, just to be clear: version 2 is still under development, so it is not possible to add analyses at this time. I will announce when we launch version 2.

I would imagine adding a new feature would roughly follow these steps:

  1. Announcing the idea for the new feature on the issue tracker.
  2. If needed, implement and add a new template SQL file such as this one. Given the complexity of these files, I would imagine you’d need help with that.
  3. Adding the analysis to the list of prespecified analyses, manually picking a unique analysis ID to prevent collisions.

Note that for features that do not fit the pattern of template SQL + parameters there is always still the option to create custom covariate builders as documented in the package vignettes.

2 Likes

Version 2 is now close to ready for release. I just created a first version of the vignette, and would appreciate any feedback.

@Gowtham_Rao: if you have analyses you would like to add I think it is safe to do so now using the process I described earlier.

2 Likes

Great work @schuemie . This is an important step forward that will
directly support the community’s characterization, estimation, and
prediction efforts.

An important feature here, that was not in the prior version without
creating custom covariates, was actually the subject of discussion in the
Patient-level Prediction tutorial last Friday. The prediction we worked on
as a group was: Amongst T=persons admitted to hospital with pneumonia,
which patients were O=required to have service in the ICU during TAR=the
same inpatient episode? We initially ran this model with all baseline
covariates, but since the prior version used day 0 as part of the covariate
time window, we observed diagnoses that were likely part of the ICU
sequence, rather than prior comorbidities. Having the ability to set the
time windows, as you show in the vignette at the bottom on page 2,
createCovariateSettings, is exactly the solution we’d require; by setting
endDays = -1, we’d limit ourselves to only baseline covariates observed
prior to the hospitalization start.

@schuemie,
Would you like requests posted here, or new git issues created for requests?

Please post requests in the GitHub issue tracker.

@schuemie - will get on this ASAP!

This is awesome. And thank you!

Thanks for wonderful job, @schuemie
I have three questions.

  1. Can you tell me when will this cool version of feature extraction package be released roughly? Because my current work is very dependent on this package (especially for temporal features).

  2. Recently, we’ve developed genetic CDM compatible with existing OMOP-CDM (Three table for genetic information were added in our mode). Now we’re working on building some prediction algorithms by using combined information of clinical and genetic data in CDM. I’m thinking about using feature extraction package with custom covariate builder to extract genetic information from these newly added tables and columns. It is possible to extract information from user-defined table in CDM by using feature extraction package version 2?

  3. In previous temporal feature branch, I found that the package extract information after cohort_start_date when I used negative start/end day whereas the expected behavior was extracting features before cohort_start_date (I should’ve used positive start/end day because of this issue). Is this problem fixed in the new version?

The example code of temporal feature branch was below:

SQL code
–condition era starts in time period
SELECT DISTINCT cp1.@row_id_field AS row_id,
CAST(ce1.condition_concept_id AS BIGINT) * 1000 + 101 AS covariate_id,
tp1.time_id AS time_id,
1 AS covariate_value
INTO #cov_co_start
FROM @cohort_temp_table cp1
INNER JOIN @cdm_database_schema.condition_era ce1
ON cp1.subject_id = ce1.person_id
INNER JOIN #time_period tp1
ON DATEDIFF(DAY, ce1.condition_era_start_date, cp1.cohort_start_date) <= tp1.end_day
AND DATEDIFF(DAY, ce1.condition_era_start_date, cp1.cohort_start_date) >= tp1.start_day
WHERE ce1.condition_concept_id != 0
{@has_excluded_covariate_concept_ids} ? { AND ce1.condition_concept_id NOT IN (SELECT concept_id FROM #excluded_cov)}
{@has_included_covariate_concept_ids} ? { AND ce1.condition_concept_id IN (SELECT concept_id FROM #included_cov)}
;

Thank you for all your contribution!
Chan

Hi @SCYou,

To answer your questions:

  1. We are in the process of finalizing the package and dependencies (CohortMethod and PatientLevelPrediction). I think we should be ready in the next two weeks, at which time we will cut over all three packages at once.

  2. FeatureExtraction is designed to work against the CDM. So the cleanest path forward would be to get your additions for genetic data in the CDM specifications, which you will have to discuss on the CDM issue tracker. Alternatively, you could write a custom covariate builder for now.

  3. Yes, for consistency all ‘days’ arguments are now relative to the cohort_start_date, so negative means going back in time and positive means going forward in time.

@schuemie Thank you for answering my questions. It’s really helpful!

Note: We will release FeatureExtraction 2.0 this week (probably Wednesday). At the same time we will release new versions of PatientLevelPrediction and CohortMethod, as the current versions of those packages are not entirely compatible with the new version of FeatureExtraction.

For those of you who still depend on the current versions of those packages, note that at any time you can install older versions of OHDSI packages through drat. For example, if somewhere in the future you would like to install the current version of FeatureExtraction you can use

library(devtools)
devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/FeatureExtraction_1.2.3.tar.gz")

Similarly, you can install the current versions of PatientLevelPrediction or CohortMethod using:

devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/CohortMethod_2.4.4.tar.gz")

devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/PatientLevelPrediction_1.2.2.tar.gz")

(Looping in @jennareps, @Rijnbeek, @jweave17)

1 Like

FeatureExtraction V2.0.0 has now been released!

Please see the vignette for details on how to use the new version.

Note that new versions of CohortMethod, CaseControl, and PatientLevelPrediction will also be released today for compatibility with the new version of FeatureExtraction…

2 Likes

Thank you for your contribution @schuemie!
I love 'createTable’function!! I want to modify my hypertension combination package to use this function, too

Great job!!

@schuemie could you please help us understand how to use createCovariateSettings or Custom Covariate Builder when there are more than one period per subject_id.

rowIdField: The name of the field in the cohort temp table that is to be used as the row_id field in the output table. This can be especially usefull if there is more than one period per person.

@param rowIdField The name of the field in the cohort temp table that is to be used as the row_id field in the output table. This can be especially usefull if there is more than one period per person.

[From vignette][1]

getDbLooCovariateData <- function(connection,
                              oracleTempSchema = NULL,
                              cdmDatabaseSchema,
                              cohortTable = "#cohort_person",
                              cohortId = -1,
                              cdmVersion = "5",
                              rowIdField = "subject_id",
                              covariateSettings,
                              aggregated = FALSE) {
 writeLines("Constructing length of observation covariates")
 if (covariateSettings$useLengthOfObs == FALSE) {
return(NULL)
 }
if (aggregated)
stop("Aggregation not supported")

# Some SQL to construct the covariate: 
sql <- paste("SELECT @row_id_field AS row_id, 1 AS covariate_id,",
           "DATEDIFF(DAY, observation_period_start_date, cohort_start_date)",
           "AS covariate_value",
           "FROM @cohort_table c",
           "INNER JOIN @cdm_database_schema.observation_period op",
           "ON op.person_id = c.subject_id",
           "WHERE cohort_start_date >= observation_period_start_date",
           "AND cohort_start_date <= observation_period_end_date",
           "{@cohort_id != -1} ? {AND cohort_definition_id = @cohort_id}")

How can we say that rowIdField is subject_id - cohort_start_date
[1]: https://github.com/OHDSI/FeatureExtraction/blob/master/vignettes/CreatingCustomCovariateBuilders.Rmd

Hi @Gowtham_Rao. If the subject_id field in your cohort table is not unique (for the cohort of interest), you’ll need to construct a column with values that is unique, and point FeatureExtraction to that column.

The way I usually do this is by using the ROW_NUMBER function in SQL, as you can see in this SQL in CohortMethod. Later, if you want. you can merge the features created by FeatureExtraction with the cohort table with the row_id column to link back to specific subjects and cohort start dates.

Thank you @schuemie that sounds like a good solution, especially row_number(order by subject_id, cohort_start_date)

Shouldn’t we make the row_number() default? It seems like this is a portal gotcha situation?

Default how? The cohort table is the input to FeatureExtraction. Whether or not it contains a row_id field (created using ROW_NUMER) is not within FeatureExtraction’s control.

Thank you @schuemie sounds like we have a known problem when there are more then one record per subject_id. The default behavior of FF is to use subject_id as row_id. So, given a situation like this

We have a problem using FF default behavior of rowId = ‘subject_id’, because we are more likely to have difficulty differentiating between features generated for the same subject_id with different cohort_start_date 5/1/2016 and 2/15/2017.

In this case, we need to use a structure something like this where we need to create a new column that uniquely identifies every row record within the same cohort_definition_id

where rowId = ‘cohort_row_id’ . Current standard tools don’t do this by default, and cohort table does not have cohort_row_id. So we have to do it outside - by creating a new rowId field by using row_number() (partition by cohort_definition_id order by subject_id, cohort_start_date)

@schuemie , can you please help me with the logic of creating custom covariates?
Looking the vignette I found the next:

cohort_definition_id, A key to link to the cohort table. Note that this will be come the covariate
ID, so you should take care that these IDs do not overlap with IDs of other covariate builders that may
be used as well

I actually want to take care of assigning cohort_definition_id in order not to overlap with other default covariates I’m going to use. It seems like I can’t just use cohort_definition_id’s from Atlas and need to reassign it as well.
Is there some range of numbers which is not used for default covariate_id’s?
I tried to obtain it via reverse engineering, but failed =(

t