Member-only story

Adding Polynomial Features != Multicollinearity

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

3 min readFeb 10, 2021

The name ‘multicollinearity’ was first introduced by Ragnar Frisch, 1934, Nobel Laureate in Economics, “I believe that a substantial part of the regression and correlation analysis which have been made on statistical data in recent years is nonsense … If the statistician does not dispose of an adequate technique for the statistical study of the confluence hierarchy, he will run the risk of adding more and more variates in the study until he gets a set that is in fact multiple collinear and where his attempt to determine a regression equation is therefore absurd.”

Multicollinearity is a pain for inferential modeling where accuracy in the estimation of coefficients is critical. Also, the introduction of polynomial terms also could make sense if the relationship between the target variable and “feature of interest” in fact is a polynomial relationship. However, that would arise the question “wouldn’t adding polynomial features naturally introduce Multicollinearity?!”. In this article, I aim to address that issue by a practical walkthrough example of adding Polynomial terms and testing for Multicollinearity.

So here is an actual example of y = 0.1 * X + 0.4 * X ^2 + 0.6 * X³

Where

import random
import pandas as pdX1 = [random.randint (0,1000) for i in range (1000)]df = pd.DataFrame ()
df ['X1'] = X1
df ['X2'] = df ['X1'] ** 2
df ['X3'] = df ['X1'] ** 3df ['y']  = 0.1*df ['X1'] +…

Adding Polynomial Features != Multicollinearity

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

No responses yet