How to generate a synthetic DataFrame with Feature Association on the Fly?
Studying the capacity and capabilities of Machine Learning requires testing on complex structures of synthetic datasets, for studying the capacity of ML algorithms to simulate understanding of Feature Association I recommend this toy dataset generated below.
The code below generates a 40x8 Feature Columns and one Target column :
Where :
x1 , x2 , x3 , x4 are 4 random variables.
z1 , z2 , z3 , z4 are another 4 random variables.
x_array : [x1,x2,x3,x4]
z_array : [z1,z2,z3,z4]
and
f(x,z) = ProductSum (x_array,z_array)
f(x,z) = x1*z1 + x2 *z2 + x3*z3 + x4*z4
# Re-importing necessary libraries and redefining the function as the code execution state was reset
import pandas as pd
import numpy as np
import random
def generate_random_int_dataframe(rows, cols):
data = {}
# Generate data for each row
for row in range(rows):
x = [random.randint(1, 100) for _ in range(cols)]
z = [random.randint(1, 100) for _ in range(cols)]
# Compute f(x, z) = sum of x_i * z_i for this row
f_xz = sum([x[i] * z[i] for i in range(cols)])
# Update data dictionary
for col in range(cols):
data.setdefault(f'x{col+1}', []).append(x[col])
data.setdefault(f'z{col+1}', []).append(z[col])
data.setdefault('f(x,z)', []).append(f_xz)
# Create a DataFrame
df = pd.DataFrame(data)
return df
# Example usage with 40 rows and 4 elements in each array
df_random_int_rows = generate_random_int_dataframe(40, 4)
df_random_int_rows
index,x1,z1,x2,z2,x3,z3,x4,z4,"f(x,z)"
0,17,95,62,59,35,97,35,86,11678
1,70,2,9,51,76,76,26,63,8013
2,70,58,81,26,16,1,9,29,6443
3,74,42,21,95,56,15,72,42,8967