OHDSI Home | Forums | Wiki | Github

FeatureExtraction 2.0

Can you please explain the difference between regular and temporal covariates?
As I understood,for any condition_occurrence we could create several various covariate_id’s with different window periods (30,180,365,all time prior etc.) What is changed now?

What I call ‘temporal covariates’ is simply for when you want lots of time windows (e.g. one for every day in the year prior). In that case, we add a timeId column to the covariate table, and use the same covariate ID across all time windows. There is a class of prediction algorithms for which that makes sense.

Now clear, thank you!
‘Adding custom covariates is now easier because one can reference additional SQL’
This is really interesting idea. In such way it’ll be also possible to find probably important covariates and therefore to implement them as default ones.

Indeed! The idea is to allow the community to contribute to an ever growing set of available covariates. I think I would still like to define a default set which may not include all, but users will be able to ‘mix and match’ depending on their needs.

@schuemie this is awesome. Have you thought about how we should work on contributing code and also creating new analysis_id and covariate_id without causing conflicts

Yes, good question.

First, just to be clear: version 2 is still under development, so it is not possible to add analyses at this time. I will announce when we launch version 2.

I would imagine adding a new feature would roughly follow these steps:

  1. Announcing the idea for the new feature on the issue tracker.
  2. If needed, implement and add a new template SQL file such as this one. Given the complexity of these files, I would imagine you’d need help with that.
  3. Adding the analysis to the list of prespecified analyses, manually picking a unique analysis ID to prevent collisions.

Note that for features that do not fit the pattern of template SQL + parameters there is always still the option to create custom covariate builders as documented in the package vignettes.

2 Likes

Version 2 is now close to ready for release. I just created a first version of the vignette, and would appreciate any feedback.

@Gowtham_Rao: if you have analyses you would like to add I think it is safe to do so now using the process I described earlier.

2 Likes

Great work @schuemie . This is an important step forward that will
directly support the community’s characterization, estimation, and
prediction efforts.

An important feature here, that was not in the prior version without
creating custom covariates, was actually the subject of discussion in the
Patient-level Prediction tutorial last Friday. The prediction we worked on
as a group was: Amongst T=persons admitted to hospital with pneumonia,
which patients were O=required to have service in the ICU during TAR=the
same inpatient episode? We initially ran this model with all baseline
covariates, but since the prior version used day 0 as part of the covariate
time window, we observed diagnoses that were likely part of the ICU
sequence, rather than prior comorbidities. Having the ability to set the
time windows, as you show in the vignette at the bottom on page 2,
createCovariateSettings, is exactly the solution we’d require; by setting
endDays = -1, we’d limit ourselves to only baseline covariates observed
prior to the hospitalization start.

@schuemie,
Would you like requests posted here, or new git issues created for requests?

Please post requests in the GitHub issue tracker.

@schuemie - will get on this ASAP!

This is awesome. And thank you!

Thanks for wonderful job, @schuemie
I have three questions.

  1. Can you tell me when will this cool version of feature extraction package be released roughly? Because my current work is very dependent on this package (especially for temporal features).

  2. Recently, we’ve developed genetic CDM compatible with existing OMOP-CDM (Three table for genetic information were added in our mode). Now we’re working on building some prediction algorithms by using combined information of clinical and genetic data in CDM. I’m thinking about using feature extraction package with custom covariate builder to extract genetic information from these newly added tables and columns. It is possible to extract information from user-defined table in CDM by using feature extraction package version 2?

  3. In previous temporal feature branch, I found that the package extract information after cohort_start_date when I used negative start/end day whereas the expected behavior was extracting features before cohort_start_date (I should’ve used positive start/end day because of this issue). Is this problem fixed in the new version?

The example code of temporal feature branch was below:

SQL code
–condition era starts in time period
SELECT DISTINCT cp1.@row_id_field AS row_id,
CAST(ce1.condition_concept_id AS BIGINT) * 1000 + 101 AS covariate_id,
tp1.time_id AS time_id,
1 AS covariate_value
INTO #cov_co_start
FROM @cohort_temp_table cp1
INNER JOIN @cdm_database_schema.condition_era ce1
ON cp1.subject_id = ce1.person_id
INNER JOIN #time_period tp1
ON DATEDIFF(DAY, ce1.condition_era_start_date, cp1.cohort_start_date) <= tp1.end_day
AND DATEDIFF(DAY, ce1.condition_era_start_date, cp1.cohort_start_date) >= tp1.start_day
WHERE ce1.condition_concept_id != 0
{@has_excluded_covariate_concept_ids} ? { AND ce1.condition_concept_id NOT IN (SELECT concept_id FROM #excluded_cov)}
{@has_included_covariate_concept_ids} ? { AND ce1.condition_concept_id IN (SELECT concept_id FROM #included_cov)}
;

Thank you for all your contribution!
Chan

Hi @SCYou,

To answer your questions:

  1. We are in the process of finalizing the package and dependencies (CohortMethod and PatientLevelPrediction). I think we should be ready in the next two weeks, at which time we will cut over all three packages at once.

  2. FeatureExtraction is designed to work against the CDM. So the cleanest path forward would be to get your additions for genetic data in the CDM specifications, which you will have to discuss on the CDM issue tracker. Alternatively, you could write a custom covariate builder for now.

  3. Yes, for consistency all ‘days’ arguments are now relative to the cohort_start_date, so negative means going back in time and positive means going forward in time.

@schuemie Thank you for answering my questions. It’s really helpful!

Note: We will release FeatureExtraction 2.0 this week (probably Wednesday). At the same time we will release new versions of PatientLevelPrediction and CohortMethod, as the current versions of those packages are not entirely compatible with the new version of FeatureExtraction.

For those of you who still depend on the current versions of those packages, note that at any time you can install older versions of OHDSI packages through drat. For example, if somewhere in the future you would like to install the current version of FeatureExtraction you can use

library(devtools)
devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/FeatureExtraction_1.2.3.tar.gz")

Similarly, you can install the current versions of PatientLevelPrediction or CohortMethod using:

devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/CohortMethod_2.4.4.tar.gz")

devtools::install_url("https://github.com/OHDSI/drat/raw/gh-pages/src/contrib/PatientLevelPrediction_1.2.2.tar.gz")

(Looping in @jennareps, @Rijnbeek, @jweave17)

1 Like

FeatureExtraction V2.0.0 has now been released!

Please see the vignette for details on how to use the new version.

Note that new versions of CohortMethod, CaseControl, and PatientLevelPrediction will also be released today for compatibility with the new version of FeatureExtraction…

2 Likes

Thank you for your contribution @schuemie!
I love 'createTable’function!! I want to modify my hypertension combination package to use this function, too

Great job!!

@schuemie could you please help us understand how to use createCovariateSettings or Custom Covariate Builder when there are more than one period per subject_id.

rowIdField: The name of the field in the cohort temp table that is to be used as the row_id field in the output table. This can be especially usefull if there is more than one period per person.

@param rowIdField The name of the field in the cohort temp table that is to be used as the row_id field in the output table. This can be especially usefull if there is more than one period per person.

[From vignette][1]

getDbLooCovariateData <- function(connection,
                              oracleTempSchema = NULL,
                              cdmDatabaseSchema,
                              cohortTable = "#cohort_person",
                              cohortId = -1,
                              cdmVersion = "5",
                              rowIdField = "subject_id",
                              covariateSettings,
                              aggregated = FALSE) {
 writeLines("Constructing length of observation covariates")
 if (covariateSettings$useLengthOfObs == FALSE) {
return(NULL)
 }
if (aggregated)
stop("Aggregation not supported")

# Some SQL to construct the covariate: 
sql <- paste("SELECT @row_id_field AS row_id, 1 AS covariate_id,",
           "DATEDIFF(DAY, observation_period_start_date, cohort_start_date)",
           "AS covariate_value",
           "FROM @cohort_table c",
           "INNER JOIN @cdm_database_schema.observation_period op",
           "ON op.person_id = c.subject_id",
           "WHERE cohort_start_date >= observation_period_start_date",
           "AND cohort_start_date <= observation_period_end_date",
           "{@cohort_id != -1} ? {AND cohort_definition_id = @cohort_id}")

How can we say that rowIdField is subject_id - cohort_start_date
[1]: https://github.com/OHDSI/FeatureExtraction/blob/master/vignettes/CreatingCustomCovariateBuilders.Rmd

Hi @Gowtham_Rao. If the subject_id field in your cohort table is not unique (for the cohort of interest), you’ll need to construct a column with values that is unique, and point FeatureExtraction to that column.

The way I usually do this is by using the ROW_NUMBER function in SQL, as you can see in this SQL in CohortMethod. Later, if you want. you can merge the features created by FeatureExtraction with the cohort table with the row_id column to link back to specific subjects and cohort start dates.

Thank you @schuemie that sounds like a good solution, especially row_number(order by subject_id, cohort_start_date)

Shouldn’t we make the row_number() default? It seems like this is a portal gotcha situation?

t