How to generate a synthetic DataFrame with Feature Association on the Fly?

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

2 min readJan 25, 2024

Studying the capacity and capabilities of Machine Learning requires testing on complex structures of synthetic datasets, for studying the capacity of ML algorithms to simulate understanding of Feature Association I recommend this toy dataset generated below.

The code below generates a 40x8 Feature Columns and one Target column :

Where :

x1 , x2 , x3 , x4 are 4 random variables.

z1 , z2 , z3 , z4 are another 4 random variables.

x_array : [x1,x2,x3,x4]

z_array : [z1,z2,z3,z4]

and

f(x,z) = ProductSum (x_array,z_array)

f(x,z) = x1*z1 + x2 *z2 + x3*z3 + x4*z4

# Re-importing necessary libraries and redefining the function as the code execution state was reset
import pandas as pd
import numpy as np
import random

def generate_random_int_dataframe(rows, cols):

    data = {}

    # Generate data for each row
    for row in range(rows):
        x = [random.randint(1, 100) for _ in range(cols)]
        z = [random.randint(1, 100) for _ in range(cols)]

        # Compute f(x, z) = sum of x_i * z_i for this row
        f_xz = sum([x[i] * z[i] for i in range(cols)])

        # Update data dictionary
        for col in range(cols):
            data.setdefault(f'x{col+1}', []).append(x[col])
            data.setdefault(f'z{col+1}', []).append(z[col])
        
        data.setdefault('f(x,z)', []).append(f_xz)

    # Create a DataFrame
    df = pd.DataFrame(data)
    return df

# Example usage with 40 rows and 4 elements in each array
df_random_int_rows = generate_random_int_dataframe(40, 4)
df_random_int_rows

index,x1,z1,x2,z2,x3,z3,x4,z4,"f(x,z)"
0,17,95,62,59,35,97,35,86,11678
1,70,2,9,51,76,76,26,63,8013
2,70,58,81,26,16,1,9,29,6443
3,74,42,21,95,56,15,72,42,8967

How to generate a synthetic DataFrame with Feature Association on the Fly?

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

No responses yet