FeatureExtraction: when should redundant variables be removed? (and normalization?)

schuemie · May 4, 2017, 7:38am

Currently the FeatureExtraction package performs two steps right after fetching covariate data from the server:

Remove redundant covariates, since these will hamper convergence of most regression methods. Redundant covariates come in two flavors:

Covariates that have the same value for every subject. For example, if I nest my analysis in a group of people with prior diabetes, probably everyone will have the value 1 for the covariate encoding prior diabetes.
A group of covariates that are mutually exclusive but together cover everyone. For example, everyone belongs to exactly 1 age group. That means that 1 of the age groups is redundant. Right now, FeatureExtraction removes the most prevalent age group.

Normalization of values, since variables can be at different scales, which will lead to problems for example during regularization. Each value is divided by the max value for that covariate.

For fitting models like in the PatientLevelPrediction package or the CohortMethod package this works fine. However, for other purposes this poses problems.

Specifically, when computing covariate balance or when characterizing the population in general we would like to see all covariates, not just the non-redundant ones, and we would like to see values at their original scale. Currently, the computeCovariateBalance fucntion temporarly removes normalization, but the redundant variables are gone, as @Ajit_Londhe pointed out here.

I see these options for moving forward:

Don’t remove redundancy and perform normalization when creating covariates. Instead, move these steps to the point in time when the model is fitted.
Keep things as they are, but add information to the covariate object that will allow reconstructing the redundant covariates later (similar to what we do for normalization, where the normalization factors are stored).
Don’t remove redundant variables when creating covariates, but flag them for deletion so they can be deleted when fitting the model.

The downside of option 1 is that removing redundant covariates sometimes requires specific knowledge of the covariates (for example that age groups together from a redundant block), and this does not fit well with our notion of custom and exchangeable covariate builders. Option two is probably the most efficient, and option 3 the most elegant.

What does everyone think?

Looping in @jennareps, @Rijnbeek, @msuchard, @jweave17, and @Patrick_Ryan.

Rijnbeek · May 4, 2017, 6:44pm

If you indeed apply some knowledge now in deleting redundant variables earlier in the process i would go for option 3 so you can move that knowledge forward in the pipeline.

We do store the max of the covariates somewhere right ? this could have impact when transporting the model.

Verstuurd vanaf mijn iPhone

schuemie · May 5, 2017, 12:03pm

Hi Peter,

Yes, the max values are stored in the covariate metadata.

Indeed, another important think to consider is the portability of covariate construction. A tricky thing here is that a variable that is not redundant in one dataset may be redundant in another.

schuemie · May 8, 2017, 2:09pm

If we go for option 3, should we also postpone normalization until right before model fitting?

schuemie · July 6, 2017, 9:26am

I think I’ve figured out how to do option 1 (postponing the removal of redundant covariates unitl right before fitting a model) as implemented here. Rather than rely on specific knowledge on the covariates it simply uses the analysis IDs to spot groups of covariates that combined have a redundancy.

So the new proposal is to not do normalization or removal of covariates up front, but only at the time when needed.