Double Machine Learning — Avocado Price Elasticity

--

Double Machine Learning (DML) is a Causal Inference technique to estimate the Average Treatment Effect of Treatment on an Outcome variable under Confoundness. In this article, we explain the concept behind DML and demo a practical example of the Avocado demand dataset with Prices as Treatment and Sold Volume as Outcome. Here we address “Computational Causality” where a Causal relationship is drawn from historical data.

1. Active Pricing Detection

1.1 Definition

1.2 Most Simple Example: stationary price trend

1.3 Pricing Patterns

1.4 Machine Learning

2. Expected Demand

2.1 Expected Demand but Not as a Function of Price

3. Machine Learning

3.1 Read Data and Process Data

3.2 ML for Price

3.3 ML for Volume

4. Machine Learning on residuals

4.1 Computing Residuals

4.2 Computing percentage deviation from the expected value

1. Active Pricing Detection

1.1 Definition

In statistics, Co-variate representation like Z-transform helps detect when a variable of interest deviates from its Expected Value (EV). We want to be able to detect when prices are different from its EV and would call that Active Pricing (AP). When a decision maker is conducting pricing experiments or trying strategies they are actively pricing. It is important to detect AP's effect on Demand (outcome variable of interest).

1.2 Most Simple Example: stationary price trend

In an experiment say of 100 days, the simplest example has a constant mean price over time, and then any deviation from that constant mean could be captured by a Z-Transform and considered as AP. The Figure below shows conventional avocado sales data in Chicago, for instance, pricing decisions above $1.4 could be considered AP.

1.3 Pricing Patterns

Prices can be influenced by time with non-stationary trends, seasonality, and cyclical patterns. Identifying Active Vs. Passive Pricing behavior becomes a little more difficult but very possible with simple time series analysis. For example, De-Trending (subtracting price values from trend-fit) is one recommended step. Then for seasonal patterns, a simple linear regression of De-trended prices with One-Hot-Encoded seasons as Features could do. Finally for cyclical patterns adding features like Sin (time) and Cos (time) on a simple regressor could work too.

1.4 Machine Learning

In more complex cases, the price will be a function of many variables and complex patterns will determine it (i.e inflation, competition, discount season (Black Friday)..etc.). These patterns can be captured by a machine learning algorithm and thus one can be used to estimate an “Expected Value” for price.

2. Expected Demand

2.1 Expected Demand but Not as a Function of Price

Much like price demand is influenced by many market variables other than price and has complex like non-stationary trends, seasonality, and cyclical patterns. Machine Learning can also be used to estimate an Expected Value for demand not as a function of price.

3. Machine Learning

3.1 Read Data and Process Data

#Source : https://www.kaggle.com/datasets/timmate/avocado-prices-2020?resource=download
#1- Read Data
import pandas as pd
path = ''
filename = 'avocado-updated-2020.csv'
datafile = path + filename
data = pd.read_csv (datafile)


#2- Feature Selection
print (data.columns)
selected_columns = ["date" , "average_price" , "total_volume" , "type" , "geography"]
data = data [selected_columns]


#3- Data PreProcesssing
print (list (set (data.geography.values))) #see unique values of geography
c = data ["geography"] == "Total U.S."
data = data [c] # filter data for Total US

for col in data.columns : print (col , type (col)) #Explore columns types

data ["date"] = pd.to_datetime (data ["date"]) # Change date column from str to date type
data ["total_volume"] = pd.to_numeric (data ["total_volume"]) # changing col type to numeric
data ["average_price"] = pd.to_numeric (data ["average_price"]) # changing col type to numeric

data = data.sort_values ("date") # sorting dataframe by date

data = data.join (pd.get_dummies (data ["type"])) # one-hot-encoding

def add_continous_time (df): # adding continous time function
df ["time_continous"] = df.reset_index ().index
return df
data = data.groupby ("type").apply (add_continous_time) # adding continous time function per categorical column value

drop_cols = ["type" , "geography" , "date","organic"] # dropping columns
data = data.drop (drop_cols,axis = 1) # dropping columns


print (data)
average_price total_volume conventional time_continous
102 0.95 31324277.73 1 0
103 1.46 612910.15 0 0
210 1.01 29063542.75 1 1
211 1.42 669528.88 0 1
318 1.03 29043458.85 1 2
... ... ... ... ...
32824 1.45 1816438.84 0 303
32931 0.89 41214193.33 1 304
32932 1.44 1739237.47 0 304
33039 0.89 36303871.76 1 305
33040 1.47 1583056.27 0 305

[612 rows x 4 columns]
#plotting average price
data.plot.scatter ("time_continous" , "average_price",c = "conventional",colormap='viridis')

3.2 ML for Price

X_cols = ["time_continous" , "conventional"] 
y_col = "average_price"
train_features = data [X_cols]
train_labels = data [y_col]

import numpy as np
from sklearn.model_selection import train_test_split
train_features , test_features , train_labels , test_labels = train_test_split( X, y, test_size=0.33, random_state=42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
'ccp_alpha': 0.0,
'criterion': 'squared_error',
'max_depth': None,
'max_features': 1.0,
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_samples_leaf': 1,
'min_samples_split': 2,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 100,
'n_jobs': None,
'oob_score': False,
'random_state': 42,
'verbose': 0,
'warm_start': False}


# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}

pprint(random_grid)
{'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(train_features, train_labels)

pprint(rf_random.best_params_)
{'n_estimators': 200,
'min_samples_split': 2,
'min_samples_leaf': 1,
'max_features': 'sqrt',
'max_depth': 50,
'bootstrap': True}

def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))

return accuracy
base_model = RandomForestRegressor(n_estimators = 10, random_state = 42)
base_model.fit(train_features, train_labels)
base_accuracy = evaluate(base_model, test_features, test_labels)

Model Performance
Average Error: 0.0458 degrees.
Accuracy = 96.46%.
best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, test_features, test_labels)
Model Performance
Average Error: 0.0431 degrees.
Accuracy = 96.59%.
print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))
Improvement of 0.14%.

The improvement from hyperparameter tuning is very low, thus it is safe I guess to just use the algorithm's default parameters. Also, the mean absolute percentage error is about 4% which is low enough.

data ["price_pred"] = base_model.predict (X)

data
average_price total_volume conventional time_continous price_pred
102 0.95 31324277.73 1 0 1.002
103 1.46 612910.15 0 0 1.448
210 1.01 29063542.75 1 1 1.002
211 1.42 669528.88 0 1 1.432
318 1.03 29043458.85 1 2 1.002
... ... ... ... ... ...
32824 1.45 1816438.84 0 303 1.416
32931 0.89 41214193.33 1 304 0.911
32932 1.44 1739237.47 0 304 1.416
33039 0.89 36303871.76 1 305 0.911
33040 1.47 1583056.27 0 305 1.416

[612 rows x 5 columns]

data.plot.scatter ("price_pred" , "average_price")

3.3 ML for Volume

The same ML pipeline used for the price is used for volume achieving a mean absolute percentage error of 8%.

pprint (data)
average_price total_volume conventional time_continous price_pred \
102 0.95 31324277.73 1 0 1.002
103 1.46 612910.15 0 0 1.448
210 1.01 29063542.75 1 1 1.002
211 1.42 669528.88 0 1 1.432
318 1.03 29043458.85 1 2 1.002
... ... ... ... ... ...
32824 1.45 1816438.84 0 303 1.416
32931 0.89 41214193.33 1 304 0.911
32932 1.44 1739237.47 0 304 1.416
33039 0.89 36303871.76 1 305 0.911
33040 1.47 1583056.27 0 305 1.416

volume_pred
102 3.175348e+07
103 6.342549e+05
210 3.175348e+07
211 6.569024e+05
318 3.175348e+07
... ...
32824 1.934104e+06
32931 4.476763e+07
32932 1.934104e+06
33039 4.476763e+07
33040 1.934104e+06

4. Machine Learning on residuals

4.1 Computing Residuals

data ["price_residuals"]  = data ["average_price"] - data ["price_pred"]
data ["volume_residuals"] = data ["total_volume"] - data ["volume_pred"]
print (data)
average_price total_volume conventional time_continous price_pred \
102 0.95 31324277.73 1 0 1.002
103 1.46 612910.15 0 0 1.448
210 1.01 29063542.75 1 1 1.002
211 1.42 669528.88 0 1 1.432
318 1.03 29043458.85 1 2 1.002
... ... ... ... ... ...
32824 1.45 1816438.84 0 303 1.416
32931 0.89 41214193.33 1 304 0.911
32932 1.44 1739237.47 0 304 1.416
33039 0.89 36303871.76 1 305 0.911
33040 1.47 1583056.27 0 305 1.416

volume_pred price_residuals volume_residuals
102 3.175348e+07 -0.052 -429206.484
103 6.342549e+05 0.012 -21344.731
210 3.175348e+07 0.008 -2689941.464
211 6.569024e+05 -0.012 12626.507
318 3.175348e+07 0.028 -2710025.364
... ... ... ...
32824 1.934104e+06 0.034 -117665.430
32931 4.476763e+07 -0.021 -3553441.380
32932 1.934104e+06 0.024 -194866.800
33039 4.476763e+07 -0.021 -8463762.950
33040 1.934104e+06 0.054 -351048.000

4.2 Computing percentage deviation from the expected value

data ["price_pdfev"]  = data ["price_residuals"] / data ["price_pred"]
data ["volume_pdfev"] = data ["volume_residuals"] / data ["volume_pred"]
print (data)

average_price total_volume conventional time_continous price_pred \
102 0.95 31324277.73 1 0 1.002
103 1.46 612910.15 0 0 1.448
210 1.01 29063542.75 1 1 1.002
211 1.42 669528.88 0 1 1.432
318 1.03 29043458.85 1 2 1.002
... ... ... ... ... ...
32824 1.45 1816438.84 0 303 1.416
32931 0.89 41214193.33 1 304 0.911
32932 1.44 1739237.47 0 304 1.416
33039 0.89 36303871.76 1 305 0.911
33040 1.47 1583056.27 0 305 1.416

volume_pred price_residuals volume_residuals price_pdfev \
102 3.175348e+07 -0.052 -429206.484 -0.051896
103 6.342549e+05 0.012 -21344.731 0.008287
210 3.175348e+07 0.008 -2689941.464 0.007984
211 6.569024e+05 -0.012 12626.507 -0.008380
318 3.175348e+07 0.028 -2710025.364 0.027944
... ... ... ... ...
32824 1.934104e+06 0.034 -117665.430 0.024011
32931 4.476763e+07 -0.021 -3553441.380 -0.023052
32932 1.934104e+06 0.024 -194866.800 0.016949
33039 4.476763e+07 -0.021 -8463762.950 -0.023052
33040 1.934104e+06 0.054 -351048.000 0.038136

volume_pdfev
102 -0.013517
103 -0.033653
210 -0.084713
211 0.019221
318 -0.085346
... ...
32824 -0.060837
32931 -0.079375
32932 -0.100753
33039 -0.189060
33040 -0.181504

[612 rows x 10 columns]

data.plot.scatter ("price_pdfev" , "volume_pdfev")

--

--

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

5 years Data Scientist and a MSc from George Mason University in Data Analytics. I enjoy experimenting with Data Science tools. emad.ezzeldin4@gmail.com