Can Machine Learning Algorithms Process Contextual Features for Regression ?!
1- Introduction Question
Many real-world ML problems include “Contextual Features”. Features that can only be considered as “Features” only when flagged important by a “Context Flag”. Furthermore, most ML feature sets are interactive in some capacity (multiplication, division , power..etc.). Take Figure 1 showing point interpolation, where point L0 is being interpolated using points L2 and L1 and the distances L11 , L12 , L21 and L22.
Here is how a machine learning experiment dataframe would look like :
L1_Lag_Point : Value of Lagging point L1 before L0 on X-axis Time. L1_Lead_Point : Value of Leading point L1 after L0 on X-axis Time. L2_Lag_Point : Value of Lagging point L2 before L0 on X-axis Time. L2_Lead_Point : Value of Lagging point L2 after L0 on X-axis Time.
L11 : Distance between interpolation point L0 and Lagging point L1.
L12 : Distance between interpolation point L0 and Leading point L1.
L21 : Distance between interpolation point L0 and Lagging point L2.
L22: Distance between interpolation point L0 and Leading point L2.
Target: Value of L0.
The distance features L11 , L12 , L21 and L22 are Contextual Features/Derived Features because they exist in the context of Laggingg and Leading points L1 and L2. Can a Machine Learning Algorithm understand that L11 is associated only with Lagging Point L1 , L12 with Leading Point L1 , L21 with Lagging point L2 and L22 with Leading point L2 ?
2- Question Answer: Yes
import numpy as np
import matplotlib.pyplot as plt
# Given values
time = 100 # days
initial = 100 # initial value
decay = 0.05 # decay rate
# Time array
t = np.linspace(0, time, 500) # using 500 points for a smooth curve
# Exponential decay function
value = initial * np.exp(-decay * t)
# Plotting
plt.figure(figsize=(10, 6))
plt.plot(t, value, label='Exponential Decay')
plt.title('Exponential Decay Function')
plt.xlabel('Time (days)')
plt.ylabel('Value')
plt.legend()
plt.grid(True)
plt.show()
For each point on the exponential, lags and lead values at random shift values. i.e. a shift lag value of 1 means the point before and a shift lead value of -1 means the point forward.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.spatial.distance import euclidean
import random
# Define the slope and intercept
slope = 2
intercept = 0
# Generate x values and calculate y values
x = np.linspace(0, 100, 1000)
y = initial * np.exp(-decay * x)
# Creating the DataFrame
df = pd.DataFrame({'x': x, 'y': y})
# Applying different random shift for each observation in y_lag and y_lead
df['lag_shift_value'] = [random.randint(100, 500) for _ in range(len(df))]
df['lead_shift_value'] = [random.randint(-500, -100) for _ in range(len(df))]
# Shifting y_lag and y_lead for each observation based on these random values
df['y_lag_random'] = [df['y'].shift(df['lag_shift_value'][i]).iloc[i] for i in range(len(df))]
df['y_lead_random'] = [df['y'].shift(df['lead_shift_value'][i]).iloc[i] for i in range(len(df))]
# Recalculating Euclidean distance for each observation
# df['y_y_lag_random_distance_2d'] = [euclidean([df['x'].iloc[i], df['y'].iloc[i]], [df['x'].iloc[i], df['y_lag_random'].iloc[i]]) if not np.isnan(df['y_lag_random'].iloc[i]) else np.nan for i in range(len(df))]
# df['y_y_lead_random_distance_2d'] = [euclidean([df['x'].iloc[i], df['y'].iloc[i]], [df['x'].iloc[i], df['y_lead_random'].iloc[i]]) if not np.isnan(df['y_lead_random'].iloc[i]) else np.nan for i in range(len(df))]
# Showing the first few rows of the DataFrame
df.head()
index,x,y,lag_shift_value,lead_shift_value,y_lag_random,y_lead_random
0,0.0,100.0,139,-253,NaN,28.188213222084123
1,0.1001001001001001,99.50074991627065,253,-157,NaN,45.348604086633785
2,0.2002002002002002,99.00399233900234,466,-102,NaN,59.421116835131016
3,0.3003003003003003,98.50971482435446,180,-255,NaN,27.491555772387176
4,0.4004004004004004,98.01790499061231,340,-389,NaN,13.988047556762929
Next is running simple linear regression with that experiment setup in Table 3 using ‘lag_shift_value’, ‘lead_shift_value’, ‘y_lag_random’, ‘y_lead_random' as Features for Target ‘y’.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Handling missing values (if any)
df = df.dropna(subset=['lag_shift_value', 'lead_shift_value', 'y_lag_random', 'y_lead_random', 'y'])
# Features and Target
X = df[['lag_shift_value', 'lead_shift_value', 'y_lag_random', 'y_lead_random']]
y = df['y']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Creating the model
model = LinearRegression()
# Training the model
model.fit(X_train, y_train)
# Predicting the target for test data
y_pred = model.predict(X_test)
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
Mean Squared Error: 10.615656109395115
R-squared: 0.9084496327931062
df ['y_pred'] = model.predict (X)
df.plot ("x",["y","y_pred"])