Debugging Computer Vision Classification: Removing Noise Images

--

Introduction

This is a follow-up post on my previous one [1] for debugging Computer Vision Image Classification models. The code snippets here are the primary modification to the initial one mentioned in [1] and [2]. Every Image Classification model has misclassifications of False Positives and False Negatives. However, to judge its performance objectively, noisy images must be removed first. Of course, noise is a normal expectation for industrial purposes but it is important to understand its effect on model performance. Thus noisy images should be identified and removed. There are different categories of noise, like those mentioned here [3] under Data Challenges. The main intersection though is that the model focuses on the environment, not the animal/creature/object for generating predictions.

I propose this methodology for extracting noisy images :

1- Insert Noisy Images by manually modulating some of the images in the original data as shown in Figure 1 below.

2- Generate clusters of images based on extracted Feature Map generated during prediction like described in [1] and [4] and implemented in [2].

3- Extract the cluster with the highest ratio of manually corrupted to original images. The original images here will be your noisy image samples in the data as shown in Figure 2 below.

I implemented this methodology on the 8GB iNarualist data built in to PyTtorch [5] with 50K images. I extracted a random 500 images and manually corrupted another 500 as described below. Model prediction with VGG16 pre-trained model in TensorFlow with Feature Map extracted with attributions by removing the last output layer as discussed in [1] , [4] and [3]. Then K-means clustering and PCA with scikit-learn were performed. 4 clusters were generated with means and one cluster had the highest ratio of corrupted to original images of 7 to one (466 out of 500 corrupted and 66/500 original). The final results were 66 noisy images extracted (see a sample below in Figure 2). For instance, the image below found upon inspection of that cluster is a good demonstration.

One of the extracted images where no creature/animal was found.
Another noisy image with the creature in camouflage with its environment

[1] https://medium.com/me/stats/post/c2a276409bf8

[2] https://github.com/Eezzeldin/ImageClustering.git

[3] https://github.com/visipedia/iwildcam_comp

[4] https://towardsdatascience.com/how-to-cluster-images-based-on-visual-similarity-cd6e7209fe34

[5] https://pytorch.org/vision/0.12/generated/torchvision.datasets.INaturalist.html

Data

from torchvision.datasets import INaturalist
INaturalist (root= "/gdrive/My Drive/Colab Notebooks/ComputerVision/ImageClustering/animals",
version = '2021_valid',
target_type= ['full'] ,
download = True)
#https://github.com/visipedia/inat_comp/tree/master/2021

img_path = "/gdrive/My Drive/Colab Notebooks/ComputerVision/ImageClustering/animals/2021_valid/"
dir_of_interest_animals = [i for i in os.listdir (img_path) if "Animalia" in i]

flowers = []
for d in dir_of_interest_animals:
# this list holds all the image filename
img_path = "/gdrive/My Drive/Colab Notebooks/ComputerVision/ImageClustering/animals/2021_valid/" + d
# creates a ScandirIterator aliased as files
with os.scandir(img_path) as files:
# loops through each file in the directory
for file in files:
if file.name.endswith('.jpg'):
# adds only the image files to the flowers list
flowers.append([img_path,file.name])


#get only 500 images (randomly selected)
import random
flowers_r = random.sample(flowers, 500)

1- Inserting Noise

def blur_image (image_name):
# Importing Image class from PIL module
import PIL
from PIL import Image

# Opens a image in RGB mode
im = Image.open(r"{}".format(image_name[0]+"/"+image_name[1]))

# Blurring the image
im1 = im.filter(PIL.ImageFilter.BoxBlur(10))
f_path = "/content/drive/MyDrive/Colab Notebooks/ComputerVision/ImageClustering/moderate_blurry_animals/"

im1.save (f_path +"blur_{}".format(image_name[1]))
i = 0 
for im in flowers_r :
try :
i = i + 1
print (i)
blur_image (im)
except :
continue
Figure 1 — Images manually Blurred with python Image filter function
for i in range (1):
with os.scandir("/content/drive/MyDrive/Colab Notebooks/ComputerVision/ImageClustering/moderate_blurry_animals/") as files:

# loops through each file in the directory
for file in files:
if file.name.startswith('blur'):
# adds only the image files to the flowers list
f_path = "/content/drive/MyDrive/Colab Notebooks/ComputerVision/ImageClustering/moderate_blurry_animals"
flowers_r.append([f_path,file.name])

2.1 — Generating Predictions

# load the model first and pass as an argument
model = VGG16()
model = Model(inputs = model.inputs, outputs = model.layers[-2].output)

def extract_features(file, model):
# load the image as a 224x224 array
img = load_img(file, target_size=(224,224))
# convert from 'PIL.Image.Image' to numpy array
img = np.array(img)
# reshape the data for the model reshape(num_of_samples, dim 1, dim 2, channels)
reshaped_img = img.reshape(1,224,224,3)
# prepare image for model
imgx = preprocess_input(reshaped_img)
# get the feature vector
features = model.predict(imgx, use_multiprocessing=True)
return features

data = {}
i = 0
#loop through each image in the dataset
for flower in flowers_r:

# try to extract the features and update the dictionary
try:
print (flower)
print (flower [0] +"/"+ flower [1])
i = i + 1
print (i)
#print (flower [0] +"/"+ flower [1])
feat = extract_features(flower [0] +"/"+ flower [1],model)
data [flower [0] +"/"+ flower [1]] = feat
except :
continue

1/1 [==============================] - 0s 109ms/step
['/content/drive/MyDrive/Colab Notebooks/ComputerVision/ImageClustering/moderate_blurry_animals', 'blur_2460b1db-c0c3-4b3e-9b25-17fa3e6dddec.jpg']

2.2 — Clustering Images (Extracting Noisy Images)

# get a list of the filenames
filenames = np.array(list(data.keys()))

# get a list of just the features
feat = np.array(list(data.values()))
feat.shape

# reshape so that there are 210 samples of 4096 vectors
feat = feat.reshape(-1,4096)
feat.shape

pca = PCA(n_components=100, random_state=22)
pca.fit(feat)
x = pca.transform(feat)

kmeans = KMeans(n_clusters= 4, random_state=22)
kmeans.fit(x)

# holds the cluster id and the images { id: [images] }
groups = {}
for file, cluster in zip(filenames,kmeans.labels_):
if cluster not in groups.keys():

groups[cluster] = []
groups[cluster].append([file,kmeans.transform(x) [list(filenames).index(file)]])
else:
groups[cluster].append([file,kmeans.transform(x) [list(filenames).index(file)]])

#image path
#distance of image to each cluster (obviously it is closes to its cluster 0)
groups [0] [0] # groups [0] means cluster number 0 and the second index is just the first image of the 0 cluster
Notebooks/ComputerVision/ImageClustering/animals/2021_valid/03542_Animalia_Chordata_Aves_Columbiformes_Columbidae_Treron_calvus/fbbc6292-1e28-44aa-a070-4c60b3f9527f.jpg',
array([46.16834 , 57.954025, 61.86411 , 55.53268 ], dtype=float32)]
for cluster in groups :
print (cluster , "blur" ,len ([g for g in groups [cluster] if "blur" in g [0]] ) ) #
print (cluster , "good" ,len ([g for g in groups [cluster] if "blur" not in g [0]] ) )

0 blur 26
0 good 186
2 blur 14
2 good 115
1 blur 436
1 good 66
3 blur 24
3 good 133

# 4 clusters produced : 0,1,2,3
# Images marked as "good" were not manually blurred by the python blurring code.

3- Inspecting Noisy Images

cluster_no = 1
i = 0
j = len (groups [cluster_no]) # group [1] means cluster 1
i_list = []
j_list = []
g_list = []
g_p = []

for g in groups [cluster_no]: # group [1] means cluster 1
i = i + 1
j = j - 1

#this is a condition to extract only original not manipulated images.
if "blur" not in g[0]: #g [0] is the name of the file
i_list.append(i)
j_list.append (j)
g_list.append (g [1] [cluster_no]) # g[1] is the transformation distanace array [1] means dist to cluster 1
g_p.append (g[0])
print (i,j , g [1] [cluster_no],g[0])
else:
continue
#dataframe of images in cluster 1 sorted by distance to cluster center
pd.DataFrame (data = {"i":i_list , "j":j_list , "g":g_list,"p":g_p}).sort_values ("g")
i j g p
50 51 451 32.568340 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
34 35 467 34.410278 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
15 16 486 37.201794 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
14 15 487 37.889763 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
21 22 480 38.000500 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
... ... ... ... ...
65 66 436 61.182003 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
4 5 497 61.317337 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
61 62 440 63.948025 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
31 32 470 71.064636 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
30 31 471 72.126785 /gdrive/My Drive/Colab Notebooks/ComputerVisio...
66 rows × 4 columns
from warnings import filterwarnings
import tensorflow as tf
from tensorflow import io
from tensorflow import image
from matplotlib import pyplot as plt
filterwarnings("ignore") 
tf_img = io.read_file(groups[cluster_no][51][0])
tf_img = image.decode_png(tf_img, channels=3)
print(tf_img.dtype)
plt.imshow(tf_img)
groups[cluster_no] [51]
['/gdrive/My Drive/Colab Notebooks/ComputerVision/ImageClustering/animals/2021_valid/00845_Animalia_Arthropoda_Insecta_Hymenoptera_Vespidae_Vespula_squamosa/be5a38f6-e883-4c25-b7e1-86409d9ed258.jpg',
array([60.284454, 68.53527 , 59.99124 , 71.699974], dtype=float32)]

Feature Maps of model predictions on blurry images have high attributions to the environment and much less to the creature.

Figure 2 — Sample images from Cluster 1 (the cluster with 466 blurred images and 66 good ones) that were not manually blurred (66 good ones).

--

--

Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup
Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

Written by Emad Ezzeldin ,Sr. DataScientist@UnitedHealthGroup

5 years Data Scientist and a MSc from George Mason University in Data Analytics. I enjoy experimenting with Data Science tools. emad.ezzeldin4@gmail.com

No responses yet