Member-only story

Adding Polynomial Features != Multicollinearity

The name ‘multicollinearity’ was first introduced by Ragnar Frisch, 1934, Nobel Laureate in Economics, “I believe that a substantial part of the regression and correlation analysis which have been made on statistical data in recent years is nonsense … If the statistician does not dispose of an adequate technique for the statistical study of the confluence hierarchy, he will run the risk of adding more and more variates in the study until he gets a set that is in fact multiple collinear and where his attempt to determine a regression equation is therefore absurd.”

Multicollinearity is a pain for inferential modeling where accuracy in the estimation of coefficients is critical. Also, the introduction of polynomial terms also could make sense if the relationship between the target variable and “feature of interest” in fact is a polynomial relationship. However, that would arise the question “wouldn’t adding polynomial features naturally introduce Multicollinearity?!”. In this article, I aim to address that issue by a practical walkthrough example of adding Polynomial terms and testing for Multicollinearity.

So here is an actual example of y = 0.1 * X + 0.4 * X ^2 + 0.6 * X³

Where

import random
import pandas as pd
X1 = [random.randint (0,1000) for i in range (1000)]df = pd.DataFrame ()
df ['X1'] = X1
df ['X2'] = df ['X1'] ** 2
df ['X3'] = df ['X1'] ** 3
df ['y'] = 0.1*df ['X1'] +…

--

--

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup
Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

5 years Data Scientist and a MSc from George Mason University in Data Analytics. I enjoy experimenting with Data Science tools. emad.ezzeldin4@gmail.com

No responses yet