Currently the FeatureExtraction package performs two steps right after fetching covariate data from the server:
- Remove redundant covariates, since these will hamper convergence of most regression methods. Redundant covariates come in two flavors:
- Covariates that have the same value for every subject. For example, if I nest my analysis in a group of people with prior diabetes, probably everyone will have the value 1 for the covariate encoding prior diabetes.
- A group of covariates that are mutually exclusive but together cover everyone. For example, everyone belongs to exactly 1 age group. That means that 1 of the age groups is redundant. Right now, FeatureExtraction removes the most prevalent age group.
- Normalization of values, since variables can be at different scales, which will lead to problems for example during regularization. Each value is divided by the max value for that covariate.
For fitting models like in the PatientLevelPrediction package or the CohortMethod package this works fine. However, for other purposes this poses problems.
Specifically, when computing covariate balance or when characterizing the population in general we would like to see all covariates, not just the non-redundant ones, and we would like to see values at their original scale. Currently, the computeCovariateBalance fucntion temporarly removes normalization, but the redundant variables are gone, as @Ajit_Londhe pointed out here.
I see these options for moving forward:
- Don’t remove redundancy and perform normalization when creating covariates. Instead, move these steps to the point in time when the model is fitted.
- Keep things as they are, but add information to the covariate object that will allow reconstructing the redundant covariates later (similar to what we do for normalization, where the normalization factors are stored).
- Don’t remove redundant variables when creating covariates, but flag them for deletion so they can be deleted when fitting the model.
The downside of option 1 is that removing redundant covariates sometimes requires specific knowledge of the covariates (for example that age groups together from a redundant block), and this does not fit well with our notion of custom and exchangeable covariate builders. Option two is probably the most efficient, and option 3 the most elegant.
What does everyone think?
Looping in @jennareps, @Rijnbeek, @msuchard, @jweave17, and @Patrick_Ryan.