One-to-many mapping causing collinearity

Julwa · October 20, 2021, 11:57am

When reviewing an output from a Lasso prediction model (PLP) we have found that among others it included pairs of covariates that have the same values per person (meaning the two variables are technically identical).
After investigating those covariates we found out that this was caused by one-to-many concept mapping of a single source variable. Other words, when one source variable is mapped to >1 (typically 2) standard concept and both are not excluded from baseline covariates, the PlpData stores two covariates with exactly the same information. Finally as regularization is applied during the training process we expected the model to push one of the coefficients towards zero. However the model contained both covariates with different coefficients non of which was pushed to zero, after the PLP run completed.

We where wondering if there are any guideline or any experience with how one-to-many concept mappings should be considered in the PLP models, if any considerations is nessesary?

@cssdenmark @Karoline_Bendix_CSS

Thank you

Best Julie and Eldar

schuemie · October 21, 2021, 7:00am

My two cents:

In real data, colinearity is unavoidable. (Unless you deliberately remove it, for example by reducing dimensionality using PCA, which I think is an interesting area of research for increasing interpretability of the model, but not necessarily for improving predictive performance)
In my experience regularized regressions are very capable of handling the colinearity we encounter. I have never observed the need to remove it.

That said, I am surprised the LASSO didn’t shrink one of the two coefficients to 0, if the variables have the exact same values.

Julwa · November 8, 2021, 2:31pm

Thank you for your reply @schuemie.

Related to 1), for now our main focus in the predictive power of the model, however it would definetly be interesting to look into the interpretability at a later point.

Related to 2) In the case where Lasso was not able to reduce one of the duplicate concepts to zero.

We mapped “Temporary stoma” from the danish register (an expanded version of ICD-10) to

|4017329|Temporary ileostomy|
|4297515|Temporary colostomy|

and
“Permament Stoma” to

|4224467|Permanent colostomy|
|4279534|Permanent ileostomy|

In an one-to-many mapping.
When investigating the output from the PLP models in the PLPViewer of the “duplicated concepts” we found the following:

Thereby the model contains multiple variables that were 100 % correlated.
We later realize that this is medically an incorrect mapping, since the patients who had a temporary stoma has either Temporary ileostomy or Temporary colostomy and not both. Therefore we adapted this in our CDM.

That said, LASSO was still not able to push one of the duplicated coefficients to zero. Which definetly launched our curiosity on how well LASSO reduce collinearity, in situations where a large amount of one-to-many mappings are presents in cases where only few source values are mapped to.

marciero · November 18, 2021, 9:13pm

Interesting question. With identical predictors you have non uniqueness of the set of coefficients that minimize the loss function-just as you would with ordinary least squares with identical predictors. In fact I dont see any reason why LASSO would force one of the coefficients to zero rather than the other, or to favor any one combination of weights on the two parameters whose sum is constant, assuming both have the same sign, because all combinations yield the same value for the loss function, the contribution being the same in both the square error and the penalty terms of the loss function. With OLS and independent predictors there is a unique analytical solution using a matrix inverse that does not exist with identical predictors, and you get an error or NA when that occurs. (Even though the solution is found numerically it relies implicitly on matrix inverse) For LASSO, there is no analytical solution and I am guessing that what is happening-how decides on one particular combination of coefficients- has to do with the method of solution and the way the algorithm works.

Christian_Reich · November 19, 2021, 1:54pm

That’s unfortunately not quite right, @Julwa. Couple problems:

Ileostomy and colostomy are Procedures to put those stomata in place. Not the actual anatomical structure itself.
You can only do a one-to-many mapping if both are true, not one of them. In this case, the procedure was either an ileostomy OR a colostomy. Such Or-combinations must not be split.

I don’t think SNOMED has such a concept. Essentially, your temporary stoma is a combination of the attributes {temporary=timing} (or permanent), {ileo- OR colo-=anatomical site} and {stoma=artificial opening}. Such a concept would have to be precoordinated as an SNOMED Extension, and placed in the hierarchy.

I know this is a lot to ask. We need to come up with a methodology to do that.

Julwa · November 22, 2021, 1:09pm

@Christian_Reich, thanks for pointing this out. We will definetly follow the discussion on the SNOMED extension.

On that note @schuemie in reference to the previous posts regarding the removal of redundant variables (FeatureExtraction: when should redundant variables be removed? (and normalization?)).
Could it in the future be of intereset to have a automated check/warning if two covariates are completely indentical or remove one of the variables? Since on of the two variables are redudant?