Discovering Interdependency Treatment Variables

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

10 min readMay 1, 2024

Causal analysis of multiple “potentially correlated” Treatment Variables requires careful experiment setup. Discovering dependency among causal variables on real-life data with different methodologies and recommending best one.

Outline

1- Data

2.1 — Method 1: Reguular Correlation

2.2 — Method 2: Phi Correlation Matrix

2.3 — Method 3: Chi-Square Test of Independence

2.4 — Method 4 : Recursive Classifier

3- Analysis

4- Conclusion

5- References

1- Data

The data is real-life survey with 100K+ input on Halloween candy comparison. [1]

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import seaborn as sns

from scipy.stats import chi2_contingency

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, balanced_accuracy_score



#1- reading data
#2- Exploring
#3- Columns
#4- adding "other_candy"
#5- normalizing winpercent
#6- treatment and outcome variables
#7- Boolean
#8- Histograms



#1- reading data
data = pd.read_csv ("/content/candy-data.csv")
data

competitorname chocolate fruity caramel peanutyalmondy nougat crispedricewafer hard bar pluribus sugarpercent pricepercent winpercent
0 100 Grand 1 0 1 0 0 1 0 1 0 0.732 0.860 66.971725
1 3 Musketeers 1 0 0 0 1 0 0 1 0 0.604 0.511 67.602936
2 One dime 0 0 0 0 0 0 0 0 0 0.011 0.116 32.261086
3 One quarter 0 0 0 0 0 0 0 0 0 0.011 0.511 46.116505
4 Air Heads 0 1 0 0 0 0 0 0 0 0.906 0.511 52.341465
... ... ... ... ... ... ... ... ... ... ... ... ... ...
80 Twizzlers 0 1 0 0 0 0 0 0 0 0.220 0.116 45.466282
81 Warheads 0 1 0 0 0 0 1 0 0 0.093 0.116 39.011898
82 Welch's Fruit Snacks 0 1 0 0 0 0 0 0 1 0.313 0.313 44.375519
83 Werther's Original Caramel 0 0 1 0 0 0 1 0 0 0.186 0.267 41.904308
84 Whoppers 1 0 0 0 0 1 0 0 1 0.872 0.848 49.524113

85 rows × 13 columns


#2- Exploring
data.info ()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85 entries, 0 to 84
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   competitorname    85 non-null     object 
 1   chocolate         85 non-null     int64  
 2   fruity            85 non-null     int64  
 3   caramel           85 non-null     int64  
 4   peanutyalmondy    85 non-null     int64  
 5   nougat            85 non-null     int64  
 6   crispedricewafer  85 non-null     int64  
 7   hard              85 non-null     int64  
 8   bar               85 non-null     int64  
 9   pluribus          85 non-null     int64  
 10  sugarpercent      85 non-null     float64
 11  pricepercent      85 non-null     float64
 12  winpercent        85 non-null     float64
dtypes: float64(3), int64(9), object(1)
memory usage: 8.8+ KB


#3- Columns 
data.columns
Index(['competitorname', 'chocolate', 'fruity', 'caramel', 'peanutyalmondy',
       'nougat', 'crispedricewafer', 'hard', 'bar', 'pluribus', 'sugarpercent',
       'pricepercent', 'winpercent'],
      dtype='object')


#4- adding "other_candy"
c1 = data ['chocolate'] == 0
c2 = data ['fruity'] == 0
c  = c1 & c2
data ["other_candy"] = np.where (c , 1 , 0)

#5- normalizing winpercent
data ["winpercent"] = data ["winpercent"] / 100


#6- treatment and outcome variables
decision_cols = ['chocolate', 'fruity', 'other_candy' ,'caramel', 'peanutyalmondy',
       'nougat', 'crispedricewafer', 'hard', 'bar', 'pluribus']
cont_cols     = ['sugarpercent','winpercent' ,'pricepercent' ]


#7- Boolean 
data [decision_cols].nunique ()
chocolate           2
fruity              2
other_candy         2
caramel             2
peanutyalmondy      2
nougat              2
crispedricewafer    2
hard                2
bar                 2
pluribus            2
dtype: int64


#8- Histograms
data [decision_cols].hist ()

2.1 - Method 1: Reguular Correlation

2.2 — Method 2: Phi Correlation Matrix

def phi_coefficient(x, y):
    contingency_table = pd.crosstab(x, y)
    n_11 = contingency_table.at[1, 1]
    n_01 = contingency_table.at[0, 1]
    n_10 = contingency_table.at[1, 0]
    n_00 = contingency_table.at[0, 0]
    return (n_11 * n_00 - n_10 * n_01) / np.sqrt((n_10 + n_11) * (n_01 + n_00) * (n_00 + n_10) * (n_01 + n_11))

# Create an empty DataFrame to store the Phi coefficients
phi_matrix = pd.DataFrame(index=data [decision_cols].columns, columns=data [decision_cols].columns)

# Fill the matrix with Phi coefficients
for col1 in data [decision_cols].columns:
    for col2 in data [decision_cols].columns:
        if col1 != col2:
            phi_matrix.loc[col1, col2] = phi_coefficient(data [decision_cols][col1], data [decision_cols][col2])
        else:
            phi_matrix.loc[col1, col2] = 1  # The correlation of a variable with itself is 1

# Convert entries to numeric for proper visualization
phi_matrix = phi_matrix.astype(float)

# Visualize the Phi correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(phi_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title('Phi Correlation Matrix for Binary Variables')
plt.show()

2.3 Method 3: Chi-Square Test of Independence

from scipy.stats import chi2_contingency

# Function to calculate Chi-square test for every pair of binary variables
def chi_square_matrix(df):
    cols = df.columns
    n = len(cols)
    p_values = pd.DataFrame(np.zeros((n, n)), columns=cols, index=cols)

    for col1 in cols:
        for col2 in cols:
            if col1 != col2:
                contingency_table = pd.crosstab(df[col1], df[col2])
                _, p, _, _ = chi2_contingency(contingency_table)
                p_values.loc[col1, col2] = p
            else:
                p_values.loc[col1, col2] = np.NaN  # Diagonal values are not applicable
    return p_values

# Calculate and print the matrix of p-values
p_values_matrix = chi_square_matrix(data [decision_cols])
print(p_values_matrix)


# Visualize the p-values matrix from the Chi-Square test
plt.figure(figsize=(10, 8))
sns.heatmap(p_values_matrix, annot=True, fmt=".2f", cmap="viridis", cbar=True, mask=np.isnan(p_values_matrix))
plt.title('P-Values from Chi-Square Test of Independence')
plt.show()

2.4 — Method 4 : Recursive Classifier

reordered_decision_cols = ['chocolate', 'fruity' , 'other_candy' , 'bar', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'pluribus',  'nougat', 'caramel']

# Store models and their accuracies
models     = []
accuracies = []
feature_imp = []
feature_imp_dfs = []
for i in range(data [reordered_decision_cols].shape[1]):
    # Use the i-th column as the target, and all others as features
    X = np.delete(data [reordered_decision_cols], i, axis=1)
    y = data [reordered_decision_cols].iloc [:, i]

    # Initialize and train the RandomForest classifier
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X, y)

    # Store the model
    models.append(model)

    # Evaluate and store accuracy
    y_pred = model.predict(X)
    accuracy = balanced_accuracy_score (y, y_pred , adjusted=True)
    accuracies.append(accuracy)
    print(f"Model {i+1} (predicting variable {i+1}): Accuracy = {accuracy:.2f}" , reordered_decision_cols [i])

    feature_imp.append (model.feature_importances_)

    print ((data [reordered_decision_cols].drop (reordered_decision_cols [i] ,axis =1).columns))
    feature_imp_dfs.append ( pd.DataFrame ([feature_imp [i]] , columns = data [reordered_decision_cols].drop (reordered_decision_cols [i] ,axis =1).columns) )

Model 1 (predicting variable 1): Accuracy = 0.97 chocolate
Index(['fruity', 'other_candy', 'bar', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'pluribus', 'nougat', 'caramel'],
      dtype='object')
Model 2 (predicting variable 2): Accuracy = 1.00 fruity
Index(['chocolate', 'other_candy', 'bar', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'pluribus', 'nougat', 'caramel'],
      dtype='object')
Model 3 (predicting variable 3): Accuracy = 1.00 other_candy
Index(['chocolate', 'fruity', 'bar', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'pluribus', 'nougat', 'caramel'],
      dtype='object')
Model 4 (predicting variable 4): Accuracy = 0.84 bar
Index(['chocolate', 'fruity', 'other_candy', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'pluribus', 'nougat', 'caramel'],
      dtype='object')
Model 5 (predicting variable 5): Accuracy = 0.27 peanutyalmondy
Index(['chocolate', 'fruity', 'other_candy', 'bar', 'hard', 'crispedricewafer',
       'pluribus', 'nougat', 'caramel'],
      dtype='object')
Model 6 (predicting variable 6): Accuracy = 0.07 hard
Index(['chocolate', 'fruity', 'other_candy', 'bar', 'peanutyalmondy',
       'crispedricewafer', 'pluribus', 'nougat', 'caramel'],
      dtype='object')
Model 7 (predicting variable 7): Accuracy = 0.42 crispedricewafer
Index(['chocolate', 'fruity', 'other_candy', 'bar', 'peanutyalmondy', 'hard',
       'pluribus', 'nougat', 'caramel'],
      dtype='object')
Model 8 (predicting variable 8): Accuracy = 0.59 pluribus
Index(['chocolate', 'fruity', 'other_candy', 'bar', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'nougat', 'caramel'],
      dtype='object')
Model 9 (predicting variable 9): Accuracy = 0.70 nougat
Index(['chocolate', 'fruity', 'other_candy', 'bar', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'pluribus', 'caramel'],
      dtype='object')
Model 10 (predicting variable 10): Accuracy = 0.40 caramel
Index(['chocolate', 'fruity', 'other_candy', 'bar', 'peanutyalmondy', 'hard',
       'crispedricewafer', 'pluribus', 'nougat'],
      dtype='object')

imp_df = pd.concat (feature_imp_dfs)
imp_df
fruity other_candy bar peanutyalmondy hard crispedricewafer pluribus nougat caramel chocolate
0 0.440885 0.238638 0.155794 0.049770 0.035924 0.014632 0.033634 0.010930 0.019794 NaN
0 NaN 0.246309 0.111903 0.057061 0.060045 0.005344 0.016599 0.006288 0.045781 0.450670
0 0.358468 NaN 0.029319 0.037395 0.024337 0.005462 0.022663 0.011999 0.050853 0.459503
0 0.112288 0.029041 NaN 0.056056 0.025493 0.070581 0.328249 0.148535 0.032595 0.197160
0 0.169825 0.056058 0.131078 NaN 0.026864 0.095765 0.116523 0.121722 0.134123 0.148043
0 0.329435 0.053421 0.048221 0.040072 NaN 0.000691 0.178395 0.003601 0.107733 0.238431
0 0.046013 0.015362 0.221445 0.151869 0.002875 NaN 0.042659 0.260858 0.169054 0.089865
0 0.080873 0.035957 0.475241 0.068302 0.066090 0.018932 NaN 0.044231 0.099111 0.111263
0 0.022232 0.065926 0.269594 0.111551 0.003564 0.209836 0.056564 NaN 0.196139 0.064594
0 0.105659 0.086848 0.112568 0.139245 0.073982 0.118082 0.122589 0.175921 NaN 0.065107

imp_df ["Decisions"] = reordered_decision_cols
imp_df ["Accuracies"] = accuracies
imp_df
fruity other_candy bar peanutyalmondy hard crispedricewafer pluribus nougat caramel chocolate Decisions Accuracies
0 0.440885 0.238638 0.155794 0.049770 0.035924 0.014632 0.033634 0.010930 0.019794 NaN chocolate 0.972973
0 NaN 0.246309 0.111903 0.057061 0.060045 0.005344 0.016599 0.006288 0.045781 0.450670 fruity 1.000000
0 0.358468 NaN 0.029319 0.037395 0.024337 0.005462 0.022663 0.011999 0.050853 0.459503 other_candy 1.000000
0 0.112288 0.029041 NaN 0.056056 0.025493 0.070581 0.328249 0.148535 0.032595 0.197160 bar 0.841518
0 0.169825 0.056058 0.131078 NaN 0.026864 0.095765 0.116523 0.121722 0.134123 0.148043 peanutyalmondy 0.271630
0 0.329435 0.053421 0.048221 0.040072 NaN 0.000691 0.178395 0.003601 0.107733 0.238431 hard 0.066667
0 0.046013 0.015362 0.221445 0.151869 0.002875 NaN 0.042659 0.260858 0.169054 0.089865 crispedricewafer 0.415751
0 0.080873 0.035957 0.475241 0.068302 0.066090 0.018932 NaN 0.044231 0.099111 0.111263 pluribus 0.592018
0 0.022232 0.065926 0.269594 0.111551 0.003564 0.209836 0.056564 NaN 0.196139 0.064594 nougat 0.701465
0 0.105659 0.086848 0.112568 0.139245 0.073982 0.118082 0.122589 0.175921 NaN 0.065107 caramel 0.400402

imp_df = imp_df [reordered_decision_cols]
imp_df ["main_cols_mean"] = np.mean (imp_df [reordered_decision_cols].iloc [:,:3] , axis= 1)
imp_df ["other_cols_mean"] = np.mean (imp_df [reordered_decision_cols].iloc [:,3:] , axis= 1)
imp_df ["ratio"]          = imp_df ["main_cols_mean"] / imp_df ["other_cols_mean"]
imp_df ["Accuracies"] = accuracies
imp_df ["Decisions"] = reordered_decision_cols
imp_df.sort_values ("Accuracies")

chocolate fruity other_candy bar peanutyalmondy hard crispedricewafer pluribus nougat caramel main_cols_mean other_cols_mean ratio Accuracies Decisions
0 0.238431 0.329435 0.053421 0.048221 0.040072 NaN 0.000691 0.178395 0.003601 0.107733 0.207096 0.063119 3.281045 0.066667 hard
0 0.148043 0.169825 0.056058 0.131078 NaN 0.026864 0.095765 0.116523 0.121722 0.134123 0.124642 0.104346 1.194507 0.271630 peanutyalmondy
0 0.065107 0.105659 0.086848 0.112568 0.139245 0.073982 0.118082 0.122589 0.175921 NaN 0.085871 0.123731 0.694015 0.400402 caramel
0 0.089865 0.046013 0.015362 0.221445 0.151869 0.002875 NaN 0.042659 0.260858 0.169054 0.050413 0.141460 0.356380 0.415751 crispedricewafer
0 0.111263 0.080873 0.035957 0.475241 0.068302 0.066090 0.018932 NaN 0.044231 0.099111 0.076031 0.128651 0.590985 0.592018 pluribus
0 0.064594 0.022232 0.065926 0.269594 0.111551 0.003564 0.209836 0.056564 NaN 0.196139 0.050917 0.141208 0.360584 0.701465 nougat
0 0.197160 0.112288 0.029041 NaN 0.056056 0.025493 0.070581 0.328249 0.148535 0.032595 0.112830 0.110252 1.023385 0.841518 bar
0 NaN 0.440885 0.238638 0.155794 0.049770 0.035924 0.014632 0.033634 0.010930 0.019794 0.339762 0.045782 7.421221 0.972973 chocolate
0 0.450670 NaN 0.246309 0.111903 0.057061 0.060045 0.005344 0.016599 0.006288 0.045781 0.348489 0.043289 8.050342 1.000000 fruity
0 0.459503 0.358468 NaN 0.029319 0.037395 0.024337 0.005462 0.022663 0.011999 0.050853 0.408986 0.026004 15.727806 1.000000 other_candy

3- Analysis

There are 3 major kinds of candy chocolate, fruity , non-chocolate/non-fruity (Other). In Data Pre-processing that “Other” was added as a treatment variable. Then there are many other possible additions/Treatments to each candy type (type;choco,fruity,other) like PenutyAlmondy,Caramel,Crispwafer..etc.

Methodology 2.1 — Regular Correlation concluded that Chocolate and Fruity are mutually exclusive which makes sense. There is a correlation between Chocolate and Bar. Then other-candy should also have strong negative correlation between chocolate and fruity because it is mutually exclusive but that does not show (probably because of small sample size of Other). There is a negative correlation between Bar and Pluribus which makes sense.

Methodology 2.2 — Phi Correlatiton Matrix shows exactly same results as methodology 2.1.

Methodology 2.3 — Chi test of independence produces 0 or none for negative co-occurance and 1 for positive mutual exclusive co-occurance. Show clearly logical relationship between Othercandy and additions usually occurring on non chocolate based Halloween candy like Nougat and penutyalmondy candy. Caramel and PenutyAlmondy are correlated. Penutyalmondy and crispwafer. Nougat and crispwafer. Hard and Pluribus.

Methodology 2.4 — Recursive Classifier , max accuracy of 1 for Fruity,Chocolate and Other which makes perfect sense because they are mutually exclusive. Then easy to predict Bar at 84% accuracy from other Treatment variables. Then Nougat at 0.7 accuracy. The rest is not easy to predict < 70%. Where the worst performance for balanced accuracy is 0.5/50% for adjusted balanced accuracy method.

chocolate fruity other_candy bar peanutyalmondy hard crispedricewafer pluribus nougat caramel main_cols_mean other_cols_mean ratio Accuracies Decisions
0 0.238431 0.329435 0.053421 0.048221 0.040072 NaN 0.000691 0.178395 0.003601 0.107733 0.207096 0.063119 3.281045 0.066667 hard
0 0.148043 0.169825 0.056058 0.131078 NaN 0.026864 0.095765 0.116523 0.121722 0.134123 0.124642 0.104346 1.194507 0.271630 peanutyalmondy
0 0.065107 0.105659 0.086848 0.112568 0.139245 0.073982 0.118082 0.122589 0.175921 NaN 0.085871 0.123731 0.694015 0.400402 caramel
0 0.089865 0.046013 0.015362 0.221445 0.151869 0.002875 NaN 0.042659 0.260858 0.169054 0.050413 0.141460 0.356380 0.415751 crispedricewafer
0 0.111263 0.080873 0.035957 0.475241 0.068302 0.066090 0.018932 NaN 0.044231 0.099111 0.076031 0.128651 0.590985 0.592018 pluribus
0 0.064594 0.022232 0.065926 0.269594 0.111551 0.003564 0.209836 0.056564 NaN 0.196139 0.050917 0.141208 0.360584 0.701465 nougat
0 0.197160 0.112288 0.029041 NaN 0.056056 0.025493 0.070581 0.328249 0.148535 0.032595 0.112830 0.110252 1.023385 0.841518 bar
0 NaN 0.440885 0.238638 0.155794 0.049770 0.035924 0.014632 0.033634 0.010930 0.019794 0.339762 0.045782 7.421221 0.972973 chocolate
0 0.450670 NaN 0.246309 0.111903 0.057061 0.060045 0.005344 0.016599 0.006288 0.045781 0.348489 0.043289 8.050342 1.000000 fruity
0 0.459503 0.358468 NaN 0.029319 0.037395 0.024337 0.005462 0.022663 0.011999 0.050853 0.408986 0.026004 15.727806 1.000000 other_candy

chocolate fruity other_candy bar peanutyalmondy hard crispedricewafer pluribus nougat caramel main_cols_mean other_cols_mean ratio Accuracies Decisions
0 0.064594 0.022232 0.065926 0.269594 0.111551 0.003564 0.209836 0.056564 NaN 0.196139 0.050917 0.141208 0.360584 0.701465 nougat
0 0.197160 0.112288 0.029041 NaN 0.056056 0.025493 0.070581 0.328249 0.148535 0.032595 0.112830 0.110252 1.023385 0.841518 bar

4- Conclusion

Methodology 2.1 and 2.2 are similar and they show very little correlations among non-mutually exclusive Treatment Variables. Then Method 2.3 show many correlations. Method 2.4 shows little predictive power except for Bar and Nougat. All in all, Method 2.1 and 2.2 do not say much , 2.3 has over inflated correlation/interdependence where 2.4 is the most reliable. Thus while assessing Treatment Variables effect on outcomes like Winpercent,Sugarpercent and Pricepercent, remove Bar and consider removing Nougat. The rest of Treatment Variables can be kept.

5- References

[1] https://www.kaggle.com/datasets/fivethirtyeight/the-ultimate-halloween-candy-power-ranking

Discovering Interdependency Treatment Variables

Outline

1- Data

2.1 - Method 1: Reguular Correlation

2.2 — Method 2: Phi Correlation Matrix

2.3 Method 3: Chi-Square Test of Independence

2.4 — Method 4 : Recursive Classifier

3- Analysis

4- Conclusion

5- References

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

No responses yet