Introduction to Time Series Cross-Validation: Mastering Predictive Accuracy in Sequential Data
Introduction
Traditional cross-validation methods, widely used in various machine learning scenarios, do not suffice when it comes to time series data. The reason? Time series data is inherently sequential and often possesses temporal dependencies, meaning that the order of data points is crucial. Simply put, time matters. Randomly splitting this type of data, as done in standard cross-validation, disrupts these temporal relationships, leading to models that fail to capture the essence of time-dependent patterns.
1- Time Series Cross-Validation
In the previous section, we introduced the concept of Time Series Cross-Validation (tsCV) and discussed its importance in the field of time series forecasting. Now, let’s dive into a practical example to demonstrate how tsCV can be implemented and utilized in a real-world scenario.
To begin, we’ll create a synthetic dataset using Python’s pandas and numpy libraries. This dataset will simulate a scenario where we have a single feature X
and a target variable y
, which are linearly related. The simplicity of this example allows us to focus on the mechanics of tsCV without the added complexity of real-world data.
In our initial exploration of Time Series Cross-Validation (tsCV), we’ll start with a straightforward approach that does not incorporate gaps between the training and testing sets. This method is particularly useful for understanding the fundamental mechanics of tsCV and serves as a baseline for more complex variations, such as introducing gaps, which we’ll cover later.
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
# Generating a dataset where X and y are linearly dependent
np.random.seed(42)
size = 100
X_values = np.linspace(0, 10, size)
y_values = 3 * X_values + np.random.normal(0, 2, size)
# Creating a DataFrame
data = pd.DataFrame({'X': X_values, 'y': y_values})
# Function for time series cross-validation without gaps
def time_series_cv(data, n_splits):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size
test_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data[:test_start]
test = data[test_start:test_end]
yield train, test
# Linear regression model
model = LinearRegression()
# DataFrame for storing results
cv_results_df = pd.DataFrame(columns=['X', 'y', 'y_pred'])
n_splits = 5
# Applying time series cross-validation
for train_index, test_index in time_series_cv(data.index, n_splits):
X_train, X_test = data.loc[train_index, 'X'].values.reshape(-1, 1), data.loc[test_index, 'X'].values.reshape(-1, 1)
y_train, y_test = data.loc[train_index, 'y'].values, data.loc[test_index, 'y'].values
# Fit the model and predict
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Append results to DataFrame
fold_results = pd.DataFrame({
'X': X_test.squeeze(),
'y': y_test,
'y_pred': y_pred
})
cv_results_df = pd.concat([cv_results_df, fold_results], ignore_index=True)
cv_results_df.head()
ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by LinearRegression.
Previous code will throw an ValueError: Found array with 0 sample(s) (shape=(0, 1)) while a minimum of 1 is required by LinearRegression. The error encountered is due to the fact that for the first fold in the cross-validation process, the training set is empty. This happens because the test set starts from the beginning of the data, leaving no data points for the training set. To fix this, you can adjust the cross-validation function to ensure that there’s always at least one data point in the training set.
# Adjusted function for time series cross-validation without gaps
def time_series_cv_adjusted(data, n_splits):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size
if test_start == 0: # Ensure at least one sample in train set
continue
test_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data[:test_start]
test = data[test_start:test_end]
yield train, test
print (cv_results_df)
X y y_pred
0 2.020202 8.991904 3.627149
1 2.121212 5.912084 3.731050
2 2.222222 6.801723 3.834951
3 2.323232 4.120201 3.938851
4 2.424242 6.183962 4.042752
import matplotlib.pyplot as plt
# Plotting X with y and y_pred
plt.figure(figsize=(12, 6))
plt.plot(cv_results_df['X'], cv_results_df['y'], label='Actual y', color='blue', marker='o')
plt.plot(cv_results_df['X'], cv_results_df['y_pred'], label='Predicted y', color='red', linestyle='--')
plt.title('Actual vs Predicted y')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.grid(True)
plt.show()
# Producing a series of plots to show the cross-validation process
fig, axes = plt.subplots(n_splits, 1, figsize=(12, 2 * n_splits))
for i, (train_index, test_index) in enumerate(time_series_cv_adjusted(data.index, n_splits)):
X_train, X_test = data.loc[train_index, 'X'].values.reshape(-1, 1), data.loc[test_index, 'X'].values.reshape(-1, 1)
y_train, y_test = data.loc[train_index, 'y'].values, data.loc[test_index, 'y'].values
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Plotting
ax = axes[i]
ax.plot(data['X'], data['y'], label='Full Data', color='grey', alpha=0.3)
ax.scatter(X_train, y_train, color='blue', label='Train Data')
ax.scatter(X_test, y_test, color='green', label='Test Data')
ax.plot(X_test, y_pred, color='red', label='Predicted on Test', linestyle='--')
ax.set_title(f"Fold {i + 1}")
ax.legend()
plt.tight_layout()
plt.show()
2- Time Series Cross-Validation with Gaps
# Function for time series cross-validation with gap adjusted
def time_series_cv_with_gap_adjusted(data, n_splits, gap=0):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size + gap
if test_start >= n_samples: # Skip if test_start is beyond the data length
continue
test_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data[:max(1, test_start - gap)] # Ensure at least one sample in train
test = data[test_start:test_end]
yield train, test
3- Time Series Cross-Validation with Lagged Features
In our ongoing exploration of Time Series Cross-Validation (tsCV), we now turn to an advanced technique that significantly enhances the model’s ability to capture temporal dynamics: the inclusion of lagged features. This approach is particularly beneficial in scenarios where past observations are predictive of future ones, a common characteristic in many time series datasets.
Lagged features are essentially previous time steps of a variable, used as additional predictors in the model. For example, if you are predicting a daily stock price, yesterday’s price (lag 1), the price from two days ago (lag 2), and so on, could be valuable predictors for today’s price.
Incorporating these lagged features allows the model to recognize patterns and dependencies over time, leading to more accurate and robust predictions, especially in time series data where autocorrelation is present.
def create_lagged_features(df, n_lags=1):
"""
Create lagged features for a time series data.
Parameters:
df (pd.DataFrame): Original DataFrame with 'y' column.
n_lags (int): Number of lagged features to create.
Returns:
pd.DataFrame: DataFrame with lagged features.
"""
for lag in range(1, n_lags + 1):
df[f'y_lag_{lag}'] = df['y'].shift(lag)
return df
# Adding lagged features to the dataset
n_lags = 3 # Number of lagged features
data_with_lags = create_lagged_features(data.copy(), n_lags)
# Dropping rows with NaN values that were created due to lagging
data_with_lags.dropna(inplace=True)
data_with_lags.head() # Displaying the first few rows with lagged features
X y y_lag_1 y_lag_2 y_lag_3
3 0.303030 3.955151 1.901438 0.026502 0.993428
4 0.404040 0.743814 3.955151 1.901438 0.026502
5 0.505051 1.046878 0.743814 3.955151 1.901438
6 0.606061 4.976607 1.046878 0.743814 3.955151
7 0.707071 3.656082 4.976607 1.046878 0.743814
# Defining a new time series cross-validation function using the DataFrame with lagged features
# Adjusting the function to work with multiple feature columns
def time_series_cv_with_lags(data, n_splits, gap=0):
n_samples = len(data)
fold_size = n_samples // n_splits
for i in range(n_splits):
test_start = i * fold_size + gap
if test_start >= n_samples: # Skip if test_start is beyond the data length
continue
test_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data.iloc[:max(1, test_start - gap)] # Ensure at least one sample in train
test = data.iloc[test_start:test_end]
yield train, test
# Applying time series cross-validation with the new dataset
cv_results_lags_df = pd.DataFrame(columns=['X', 'y', 'y_pred'])
feature_cols = ['X', 'y_lag_1', 'y_lag_2', 'y_lag_3']
target_col = 'y'
for train, test in time_series_cv_with_lags(data_with_lags, n_splits, gap):
X_train = train[feature_cols]
y_train = train[target_col]
X_test = test[feature_cols]
y_test = test[target_col]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Append results to DataFrame
test_with_pred = test.copy()
test_with_pred['y_pred'] = y_pred
cv_results_lags_df = pd.concat([cv_results_lags_df, test_with_pred[['X', 'y', 'y_pred']]], ignore_index=True)
cv_results_lags_df.head() # Displaying the first few rows of the results DataFrameX y y_pred
0 0.404040 0.743814 3.955151
1 0.505051 1.046878 3.955151
2 0.606061 4.976607 3.955151
3 0.707071 3.656082 3.955151
4 0.808081 1.485294 3.955151
# Producing a series of plots to show the cross-validation process with lagged features
fig, axes = plt.subplots(n_splits, 1, figsize=(12, 2 * n_splits))
for i, (train, test) in enumerate(time_series_cv_with_lags(data_with_lags, n_splits, gap)):
X_train = train[feature_cols]
y_train = train[target_col]
X_test = test[feature_cols]
y_test = test[target_col]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Plotting
ax = axes[i]
ax.plot(data_with_lags['X'], data_with_lags['y'], label='Full Data', color='grey', alpha=0.3)
ax.scatter(X_train['X'], y_train, color='blue', label='Train Data')
ax.scatter(X_test['X'], y_test, color='green', label='Test Data')
ax.plot(X_test['X'], y_pred, color='red', label='Predicted on Test', linestyle='--')
ax.set_title(f"Fold {i + 1}")
ax.legend()
plt.tight_layout()
plt.show()
You can see that the linear regression model is adjusting better to the data with lagged features.
4- Adding an Offset
You can add an initial offset to the data too to make sure there is enough training data to begin with.
# Adjusting the time series cross-validation function to include an offset
def time_series_cv_with_lags_and_offset(data, n_splits, gap=0, offset=0):
"""
Perform time series cross-validation with lagged features and an offset.
Parameters:
data (pd.DataFrame): DataFrame containing the features and target.
n_splits (int): Number of splits/folds for cross-validation.
gap (int): Gap between train and test sets.
offset (int): Offset to start the cross-validation.
Yields:
train (pd.DataFrame): Training set for the current split.
test (pd.DataFrame): Testing set for the current split.
"""
n_samples = len(data)
fold_size = (n_samples - offset) // n_splits
for i in range(n_splits):
test_start = offset + i * fold_size + gap
if test_start >= n_samples: # Skip if test_start is beyond the data length
continue
test_end = test_start + fold_size if i < n_splits - 1 else n_samples
train = data.iloc[:max(offset, test_start - gap)] # Ensure at least some samples in train
test = data.iloc[test_start:test_end]
yield train, test
# Define the offset
offset = 10 # For example
# Applying time series cross-validation with the new dataset and offset
cv_results_lags_offset_df = pd.DataFrame(columns=['X', 'y', 'y_pred'])
# Creating plots for each fold
fig, axes = plt.subplots(n_splits, 1, figsize=(12, 2 * n_splits))
for i, (train, test) in enumerate(time_series_cv_with_lags_and_offset(data_with_lags, n_splits, gap, offset)):
X_train = train[feature_cols]
y_train = train[target_col]
X_test = test[feature_cols]
y_test = test[target_col]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Append results to DataFrame
test_with_pred = test.copy()
test_with_pred['y_pred'] = y_pred
cv_results_lags_offset_df = pd.concat([cv_results_lags_offset_df, test_with_pred[['X', 'y', 'y_pred']]], ignore_index=True)
# Plotting
ax = axes[i]
ax.plot(data_with_lags['X'], data_with_lags['y'], label='Full Data', color='grey', alpha=0.3)
ax.scatter(X_train['X'], y_train, color='blue', label='Train Data')
ax.scatter(X_test['X'], y_test, color='green', label='Test Data')
ax.plot(X_test['X'], y_pred, color='red', label='Predicted on Test', linestyle='--')
ax.set_title(f"Fold {i + 1} (Offset: {offset})")
ax.legend()
plt.tight_layout()
plt.show()
5- Conclusion
Different implementations to the Time Series CV function were demoed, 1- with a minimum of one fold, 2- a minimum of one sample 3- with gaps, 4- with lagged features, and finally 5- with an off-set. I hope you found this a good introduction to the topic. Please leave comments and claps if you found this informative. Thanks!