Member-only story
Adding Polynomial Features != Multicollinearity
The name ‘multicollinearity’ was first introduced by Ragnar Frisch, 1934, Nobel Laureate in Economics, “I believe that a substantial part of the regression and correlation analysis which have been made on statistical data in recent years is nonsense … If the statistician does not dispose of an adequate technique for the statistical study of the confluence hierarchy, he will run the risk of adding more and more variates in the study until he gets a set that is in fact multiple collinear and where his attempt to determine a regression equation is therefore absurd.”
Multicollinearity is a pain for inferential modeling where accuracy in the estimation of coefficients is critical. Also, the introduction of polynomial terms also could make sense if the relationship between the target variable and “feature of interest” in fact is a polynomial relationship. However, that would arise the question “wouldn’t adding polynomial features naturally introduce Multicollinearity?!”. In this article, I aim to address that issue by a practical walkthrough example of adding Polynomial terms and testing for Multicollinearity.
So here is an actual example of y = 0.1 * X + 0.4 * X ^2 + 0.6 * X³
Where
import random
import pandas as pdX1 = [random.randint (0,1000) for i in range (1000)]df = pd.DataFrame ()
df ['X1'] = X1
df ['X2'] = df ['X1'] ** 2
df ['X3'] = df ['X1'] ** 3df ['y'] = 0.1*df ['X1'] +…