How to generate a synthetic DataFrame with Feature Association on the Fly?

--

Studying the capacity and capabilities of Machine Learning requires testing on complex structures of synthetic datasets, for studying the capacity of ML algorithms to simulate understanding of Feature Association I recommend this toy dataset generated below.

The code below generates a 40x8 Feature Columns and one Target column :

Where :

x1 , x2 , x3 , x4 are 4 random variables.

z1 , z2 , z3 , z4 are another 4 random variables.

x_array : [x1,x2,x3,x4]

z_array : [z1,z2,z3,z4]

and

f(x,z) = ProductSum (x_array,z_array)

f(x,z) = x1*z1 + x2 *z2 + x3*z3 + x4*z4

# Re-importing necessary libraries and redefining the function as the code execution state was reset
import pandas as pd
import numpy as np
import random

def generate_random_int_dataframe(rows, cols):

data = {}

# Generate data for each row
for row in range(rows):
x = [random.randint(1, 100) for _ in range(cols)]
z = [random.randint(1, 100) for _ in range(cols)]

# Compute f(x, z) = sum of x_i * z_i for this row
f_xz = sum([x[i] * z[i] for i in range(cols)])

# Update data dictionary
for col in range(cols):
data.setdefault(f'x{col+1}', []).append(x[col])
data.setdefault(f'z{col+1}', []).append(z[col])

data.setdefault('f(x,z)', []).append(f_xz)

# Create a DataFrame
df = pd.DataFrame(data)
return df

# Example usage with 40 rows and 4 elements in each array
df_random_int_rows = generate_random_int_dataframe(40, 4)
df_random_int_rows

index,x1,z1,x2,z2,x3,z3,x4,z4,"f(x,z)"
0,17,95,62,59,35,97,35,86,11678
1,70,2,9,51,76,76,26,63,8013
2,70,58,81,26,16,1,9,29,6443
3,74,42,21,95,56,15,72,42,8967

--

--

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup
Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

5 years Data Scientist and a MSc from George Mason University in Data Analytics. I enjoy experimenting with Data Science tools. emad.ezzeldin4@gmail.com

No responses yet