Double Machine Learning — Avocado Price Elasticity

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

8 min readApr 20, 2023

Double Machine Learning (DML) is a Causal Inference technique to estimate the Average Treatment Effect of Treatment on an Outcome variable under Confoundness. In this article, we explain the concept behind DML and demo a practical example of the Avocado demand dataset with Prices as Treatment and Sold Volume as Outcome. Here we address “Computational Causality” where a Causal relationship is drawn from historical data.

1. Active Pricing Detection

1.1 Definition

1.2 Most Simple Example: stationary price trend

1.3 Pricing Patterns

1.4 Machine Learning

2. Expected Demand

2.1 Expected Demand but Not as a Function of Price

3. Machine Learning

3.1 Read Data and Process Data

3.2 ML for Price

3.3 ML for Volume

4. Machine Learning on residuals

4.1 Computing Residuals

4.2 Computing percentage deviation from the expected value

1. Active Pricing Detection

1.1 Definition

In statistics, Co-variate representation like Z-transform helps detect when a variable of interest deviates from its Expected Value (EV). We want to be able to detect when prices are different from its EV and would call that Active Pricing (AP). When a decision maker is conducting pricing experiments or trying strategies they are actively pricing. It is important to detect AP's effect on Demand (outcome variable of interest).

1.2 Most Simple Example: stationary price trend

In an experiment say of 100 days, the simplest example has a constant mean price over time, and then any deviation from that constant mean could be captured by a Z-Transform and considered as AP. The Figure below shows conventional avocado sales data in Chicago, for instance, pricing decisions above $1.4 could be considered AP.

1.3 Pricing Patterns

Prices can be influenced by time with non-stationary trends, seasonality, and cyclical patterns. Identifying Active Vs. Passive Pricing behavior becomes a little more difficult but very possible with simple time series analysis. For example, De-Trending (subtracting price values from trend-fit) is one recommended step. Then for seasonal patterns, a simple linear regression of De-trended prices with One-Hot-Encoded seasons as Features could do. Finally for cyclical patterns adding features like Sin (time) and Cos (time) on a simple regressor could work too.

1.4 Machine Learning

In more complex cases, the price will be a function of many variables and complex patterns will determine it (i.e inflation, competition, discount season (Black Friday)..etc.). These patterns can be captured by a machine learning algorithm and thus one can be used to estimate an “Expected Value” for price.

2. Expected Demand

2.1 Expected Demand but Not as a Function of Price

Much like price demand is influenced by many market variables other than price and has complex like non-stationary trends, seasonality, and cyclical patterns. Machine Learning can also be used to estimate an Expected Value for demand not as a function of price.

3. Machine Learning

3.1 Read Data and Process Data

#Source : https://www.kaggle.com/datasets/timmate/avocado-prices-2020?resource=download
#1- Read Data
import pandas as pd
path     = ''
filename = 'avocado-updated-2020.csv'
datafile = path + filename
data = pd.read_csv (datafile)


#2- Feature Selection 
print (data.columns)
selected_columns = ["date" , "average_price" , "total_volume" , "type" , "geography"]
data             = data [selected_columns]


#3- Data PreProcesssing
print (list (set (data.geography.values))) #see unique values of geography
c               = data ["geography"] == "Total U.S."
data            = data [c]  # filter data for Total US

for col in data.columns : print (col , type (col)) #Explore columns types 

data ["date"]          = pd.to_datetime (data ["date"])  # Change date column from str to date type 
data ["total_volume"]  = pd.to_numeric (data ["total_volume"])  # changing col type to numeric 
data ["average_price"] = pd.to_numeric (data ["average_price"]) # changing col type to numeric

data                   = data.sort_values ("date") # sorting dataframe by date

data      = data.join (pd.get_dummies (data ["type"])) # one-hot-encoding

def add_continous_time (df):  # adding continous time function
    df ["time_continous"] = df.reset_index ().index
    return df 
data = data.groupby ("type").apply (add_continous_time)    # adding continous time function per categorical column value

drop_cols = ["type" , "geography" , "date","organic"] # dropping columns 
data      = data.drop (drop_cols,axis = 1) # dropping columns


print (data)
        average_price  total_volume  conventional  time_continous
102             0.95   31324277.73             1               0
103             1.46     612910.15             0               0
210             1.01   29063542.75             1               1
211             1.42     669528.88             0               1
318             1.03   29043458.85             1               2
...              ...           ...           ...             ...
32824           1.45    1816438.84             0             303
32931           0.89   41214193.33             1             304
32932           1.44    1739237.47             0             304
33039           0.89   36303871.76             1             305
33040           1.47    1583056.27             0             305

[612 rows x 4 columns]

#plotting average price
data.plot.scatter ("time_continous" , "average_price",c = "conventional",colormap='viridis')

3.2 ML for Price

X_cols = ["time_continous" , "conventional"] 
y_col  = "average_price"
train_features      = data [X_cols]
train_labels       = data [y_col]

import numpy as np 
from sklearn.model_selection import train_test_split 
train_features , test_features , train_labels , test_labels     = train_test_split( X, y, test_size=0.33, random_state=42)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint
# Look at parameters used by our current forest
print('Parameters currently in use:\n')
pprint(rf.get_params())

Parameters currently in use:

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}


# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

pprint(random_grid)
{'bootstrap': [True, False],
 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
 'max_features': ['auto', 'sqrt'],
 'min_samples_leaf': [1, 2, 4],
 'min_samples_split': [2, 5, 10],
 'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}


# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fit the random search model
rf_random.fit(train_features, train_labels)

pprint(rf_random.best_params_)
{'n_estimators': 200,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'sqrt',
 'max_depth': 50,
 'bootstrap': True}

def evaluate(model, test_features, test_labels):
    predictions = model.predict(test_features)
    errors = abs(predictions - test_labels)
    mape = 100 * np.mean(errors / test_labels)
    accuracy = 100 - mape
    print('Model Performance')
    print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
    print('Accuracy = {:0.2f}%.'.format(accuracy))
    
    return accuracy
base_model = RandomForestRegressor(n_estimators = 10, random_state = 42)
base_model.fit(train_features, train_labels)
base_accuracy = evaluate(base_model, test_features, test_labels)

Model Performance
Average Error: 0.0458 degrees.
Accuracy = 96.46%.

best_random = rf_random.best_estimator_
random_accuracy = evaluate(best_random, test_features, test_labels)
Model Performance
Average Error: 0.0431 degrees.
Accuracy = 96.59%.

print('Improvement of {:0.2f}%.'.format( 100 * (random_accuracy - base_accuracy) / base_accuracy))
Improvement of 0.14%.

The improvement from hyperparameter tuning is very low, thus it is safe I guess to just use the algorithm's default parameters. Also, the mean absolute percentage error is about 4% which is low enough.

data ["price_pred"] = base_model.predict (X)

data
average_price  total_volume  conventional  time_continous  price_pred
102             0.95   31324277.73             1               0       1.002
103             1.46     612910.15             0               0       1.448
210             1.01   29063542.75             1               1       1.002
211             1.42     669528.88             0               1       1.432
318             1.03   29043458.85             1               2       1.002
...              ...           ...           ...             ...         ...
32824           1.45    1816438.84             0             303       1.416
32931           0.89   41214193.33             1             304       0.911
32932           1.44    1739237.47             0             304       1.416
33039           0.89   36303871.76             1             305       0.911
33040           1.47    1583056.27             0             305       1.416

[612 rows x 5 columns]


data.plot.scatter ("price_pred" , "average_price")

3.3 ML for Volume

The same ML pipeline used for the price is used for volume achieving a mean absolute percentage error of 8%.

pprint (data)
       average_price  total_volume  conventional  time_continous  price_pred  \
102             0.95   31324277.73             1               0       1.002   
103             1.46     612910.15             0               0       1.448   
210             1.01   29063542.75             1               1       1.002   
211             1.42     669528.88             0               1       1.432   
318             1.03   29043458.85             1               2       1.002   
...              ...           ...           ...             ...         ...   
32824           1.45    1816438.84             0             303       1.416   
32931           0.89   41214193.33             1             304       0.911   
32932           1.44    1739237.47             0             304       1.416   
33039           0.89   36303871.76             1             305       0.911   
33040           1.47    1583056.27             0             305       1.416   

        volume_pred  
102    3.175348e+07  
103    6.342549e+05  
210    3.175348e+07  
211    6.569024e+05  
318    3.175348e+07  
...             ...  
32824  1.934104e+06  
32931  4.476763e+07  
32932  1.934104e+06  
33039  4.476763e+07  
33040  1.934104e+06

4. Machine Learning on residuals

4.1 Computing Residuals

data ["price_residuals"]  = data ["average_price"] - data ["price_pred"]
data ["volume_residuals"] = data ["total_volume"]  - data ["volume_pred"]
print (data)
      average_price  total_volume  conventional  time_continous  price_pred  \
102             0.95   31324277.73             1               0       1.002   
103             1.46     612910.15             0               0       1.448   
210             1.01   29063542.75             1               1       1.002   
211             1.42     669528.88             0               1       1.432   
318             1.03   29043458.85             1               2       1.002   
...              ...           ...           ...             ...         ...   
32824           1.45    1816438.84             0             303       1.416   
32931           0.89   41214193.33             1             304       0.911   
32932           1.44    1739237.47             0             304       1.416   
33039           0.89   36303871.76             1             305       0.911   
33040           1.47    1583056.27             0             305       1.416   

        volume_pred  price_residuals  volume_residuals  
102    3.175348e+07           -0.052       -429206.484  
103    6.342549e+05            0.012        -21344.731  
210    3.175348e+07            0.008      -2689941.464  
211    6.569024e+05           -0.012         12626.507  
318    3.175348e+07            0.028      -2710025.364  
...             ...              ...               ...  
32824  1.934104e+06            0.034       -117665.430  
32931  4.476763e+07           -0.021      -3553441.380  
32932  1.934104e+06            0.024       -194866.800  
33039  4.476763e+07           -0.021      -8463762.950  
33040  1.934104e+06            0.054       -351048.000

4.2 Computing percentage deviation from the expected value

data ["price_pdfev"]  = data ["price_residuals"] / data ["price_pred"]
data ["volume_pdfev"] = data ["volume_residuals"]  / data ["volume_pred"]
print (data)

average_price  total_volume  conventional  time_continous  price_pred  \
102             0.95   31324277.73             1               0       1.002   
103             1.46     612910.15             0               0       1.448   
210             1.01   29063542.75             1               1       1.002   
211             1.42     669528.88             0               1       1.432   
318             1.03   29043458.85             1               2       1.002   
...              ...           ...           ...             ...         ...   
32824           1.45    1816438.84             0             303       1.416   
32931           0.89   41214193.33             1             304       0.911   
32932           1.44    1739237.47             0             304       1.416   
33039           0.89   36303871.76             1             305       0.911   
33040           1.47    1583056.27             0             305       1.416   

        volume_pred  price_residuals  volume_residuals  price_pdfev  \
102    3.175348e+07           -0.052       -429206.484    -0.051896   
103    6.342549e+05            0.012        -21344.731     0.008287   
210    3.175348e+07            0.008      -2689941.464     0.007984   
211    6.569024e+05           -0.012         12626.507    -0.008380   
318    3.175348e+07            0.028      -2710025.364     0.027944   
...             ...              ...               ...          ...   
32824  1.934104e+06            0.034       -117665.430     0.024011   
32931  4.476763e+07           -0.021      -3553441.380    -0.023052   
32932  1.934104e+06            0.024       -194866.800     0.016949   
33039  4.476763e+07           -0.021      -8463762.950    -0.023052   
33040  1.934104e+06            0.054       -351048.000     0.038136   

       volume_pdfev  
102       -0.013517  
103       -0.033653  
210       -0.084713  
211        0.019221  
318       -0.085346  
...             ...  
32824     -0.060837  
32931     -0.079375  
32932     -0.100753  
33039     -0.189060  
33040     -0.181504  

[612 rows x 10 columns]


data.plot.scatter ("price_pdfev" , "volume_pdfev")

Double Machine Learning — Avocado Price Elasticity

1. Active Pricing Detection

1.1 Definition

1.2 Most Simple Example: stationary price trend

1.3 Pricing Patterns

1.4 Machine Learning

2. Expected Demand

2.1 Expected Demand but Not as a Function of Price

3. Machine Learning

3.1 Read Data and Process Data

3.2 ML for Price

3.3 ML for Volume

4. Machine Learning on residuals

4.1 Computing Residuals

4.2 Computing percentage deviation from the expected value

1. Active Pricing Detection

1.1 Definition

1.2 Most Simple Example: stationary price trend

1.3 Pricing Patterns

1.4 Machine Learning

2. Expected Demand

2.1 Expected Demand but Not as a Function of Price

3. Machine Learning

3.1 Read Data and Process Data

3.2 ML for Price

3.3 ML for Volume

4. Machine Learning on residuals

4.1 Computing Residuals

4.2 Computing percentage deviation from the expected value

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup