Can Machine Learning Algorithms Process Contextual Features for Regression ?!

--

1- Introduction Question

Many real-world ML problems include “Contextual Features”. Features that can only be considered as “Features” only when flagged important by a “Context Flag”. Furthermore, most ML feature sets are interactive in some capacity (multiplication, division , power..etc.). Take Figure 1 showing point interpolation, where point L0 is being interpolated using points L2 and L1 and the distances L11 , L12 , L21 and L22.

Figure 1. Point Interpolation using two lag and lead points

Here is how a machine learning experiment dataframe would look like :

Table 1. ML Dataframe for interpolation problem.

L1_Lag_Point : Value of Lagging point L1 before L0 on X-axis Time. L1_Lead_Point : Value of Leading point L1 after L0 on X-axis Time. L2_Lag_Point : Value of Lagging point L2 before L0 on X-axis Time. L2_Lead_Point : Value of Lagging point L2 after L0 on X-axis Time.

L11 : Distance between interpolation point L0 and Lagging point L1.

L12 : Distance between interpolation point L0 and Leading point L1.

L21 : Distance between interpolation point L0 and Lagging point L2.

L22: Distance between interpolation point L0 and Leading point L2.

Target: Value of L0.

The distance features L11 , L12 , L21 and L22 are Contextual Features/Derived Features because they exist in the context of Laggingg and Leading points L1 and L2. Can a Machine Learning Algorithm understand that L11 is associated only with Lagging Point L1 , L12 with Leading Point L1 , L21 with Lagging point L2 and L22 with Leading point L2 ?

Table 2 showing association between leading/lagging features and their respective distance features

2- Question Answer: Yes

import numpy as np
import matplotlib.pyplot as plt

# Given values
time = 100 # days
initial = 100 # initial value
decay = 0.05 # decay rate

# Time array
t = np.linspace(0, time, 500) # using 500 points for a smooth curve

# Exponential decay function
value = initial * np.exp(-decay * t)

# Plotting
plt.figure(figsize=(10, 6))
plt.plot(t, value, label='Exponential Decay')
plt.title('Exponential Decay Function')
plt.xlabel('Time (days)')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
Figure 2. Just a simple exponential function

For each point on the exponential, lags and lead values at random shift values. i.e. a shift lag value of 1 means the point before and a shift lead value of -1 means the point forward.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.spatial.distance import euclidean
import random

# Define the slope and intercept
slope = 2
intercept = 0

# Generate x values and calculate y values
x = np.linspace(0, 100, 1000)
y = initial * np.exp(-decay * x)

# Creating the DataFrame
df = pd.DataFrame({'x': x, 'y': y})

# Applying different random shift for each observation in y_lag and y_lead
df['lag_shift_value'] = [random.randint(100, 500) for _ in range(len(df))]
df['lead_shift_value'] = [random.randint(-500, -100) for _ in range(len(df))]

# Shifting y_lag and y_lead for each observation based on these random values
df['y_lag_random'] = [df['y'].shift(df['lag_shift_value'][i]).iloc[i] for i in range(len(df))]
df['y_lead_random'] = [df['y'].shift(df['lead_shift_value'][i]).iloc[i] for i in range(len(df))]

# Recalculating Euclidean distance for each observation
# df['y_y_lag_random_distance_2d'] = [euclidean([df['x'].iloc[i], df['y'].iloc[i]], [df['x'].iloc[i], df['y_lag_random'].iloc[i]]) if not np.isnan(df['y_lag_random'].iloc[i]) else np.nan for i in range(len(df))]
# df['y_y_lead_random_distance_2d'] = [euclidean([df['x'].iloc[i], df['y'].iloc[i]], [df['x'].iloc[i], df['y_lead_random'].iloc[i]]) if not np.isnan(df['y_lead_random'].iloc[i]) else np.nan for i in range(len(df))]

# Showing the first few rows of the DataFrame
df.head()
index,x,y,lag_shift_value,lead_shift_value,y_lag_random,y_lead_random
0,0.0,100.0,139,-253,NaN,28.188213222084123
1,0.1001001001001001,99.50074991627065,253,-157,NaN,45.348604086633785
2,0.2002002002002002,99.00399233900234,466,-102,NaN,59.421116835131016
3,0.3003003003003003,98.50971482435446,180,-255,NaN,27.491555772387176
4,0.4004004004004004,98.01790499061231,340,-389,NaN,13.988047556762929
Table 3 shows formatting the problem in a more generic way

Next is running simple linear regression with that experiment setup in Table 3 using ‘lag_shift_value’, ‘lead_shift_value’, ‘y_lag_random’, ‘y_lead_random' as Features for Target ‘y’.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Handling missing values (if any)
df = df.dropna(subset=['lag_shift_value', 'lead_shift_value', 'y_lag_random', 'y_lead_random', 'y'])

# Features and Target
X = df[['lag_shift_value', 'lead_shift_value', 'y_lag_random', 'y_lead_random']]
y = df['y']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating the model
model = LinearRegression()

# Training the model
model.fit(X_train, y_train)

# Predicting the target for test data
y_pred = model.predict(X_test)

# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 10.615656109395115
R-squared: 0.9084496327931062

df ['y_pred'] = model.predict (X)
df.plot ("x",["y","y_pred"])
Figure 3. Showing regression results on interpolating data points with random shift values

--

--

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup
Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

5 years Data Scientist and a MSc from George Mason University in Data Analytics. I enjoy experimenting with Data Science tools. emad.ezzeldin4@gmail.com

No responses yet