Generative Deep Learning Updated EditionChapter 55

4.5 Evaluating the Model

Section 5 of 7-~ 12 min read-Synced from Cuantum content

Evaluating the performance of a Generative Adversarial Network (GAN) is crucial to understand how well the model is generating realistic images and to identify areas for improvement. This section will cover both qualitative and quantitative methods for evaluating the GAN model trained on face generation. We will discuss metrics like Inception Score (IS) and Fréchet Inception Distance (FID), and provide example codes to calculate these metrics.

4.5.1 Qualitative Evaluation

Qualitative evaluation involves visually inspecting the generated images to assess their realism and diversity. This method is subjective but essential for gaining an initial understanding of the model's performance. Here are some aspects to consider during qualitative evaluation:

Realism: Do the generated images look like real faces?

Diversity: Are the generated images diverse, covering a wide range of facial features and expressions?

Artifacts: Are there any noticeable artifacts or inconsistencies in the generated images?

Example: Visualizing Generated Images

You can visualize the generated images using matplotlib to perform a qualitative evaluation:

import matplotlib.pyplot as pltimport numpy as np def plot_generated_images(generator, latent_dim, n_samples=10):    noise = np.random.normal(0, 1, (n_samples, latent_dim))    generated_images = generator.predict(noise)    generated_images = (generated_images * 127.5 + 127.5).astype(np.uint8)  # Rescale to [0, 255]     plt.figure(figsize=(20, 2))    for i in range(n_samples):        plt.subplot(1, n_samples, i + 1)        plt.imshow(generated_images[i])        plt.axis('off')    plt.show() # Generate and plot new faces for qualitative evaluationlatent_dim = 100plot_generated_images(generator, latent_dim, n_samples=10)

The function plotgeneratedimages generates a specified number of images (default is 10) using the generator. It creates random noise with a normal distribution, feeds it to the generator model, and then rescales the outputted images to have pixel values in the range of [0, 255]. The images are then displayed in a plot with the specified figure size.

The last two lines of code call this function using a generator model and a latent dimension of 100, generating and displaying 10 images.

4.5.2 Quantitative Evaluation

Quantitative evaluation provides objective measures of the quality and diversity of the generated images. Two widely used metrics for evaluating GANs are the Inception Score (IS) and the Fréchet Inception Distance (FID).

Inception Score (IS)

The Inception Score measures the quality and diversity of the generated images by evaluating how well they match the class labels predicted by a pre-trained Inception network. Higher scores indicate better quality and diversity.

Formula:

FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.

Example: Calculating Inception Score

from tensorflow.keras.applications.inception_v3 import InceptionV3, preprocess_inputfrom scipy.stats import entropyimport numpy as np def calculate_inception_score(images, n_split=10, eps=1E-16):    # Load InceptionV3 model    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))    images_resized = tf.image.resize(images, (299, 299))    images_preprocessed = preprocess_input(images_resized)    # Predict the probability distribution    preds = model.predict(images_preprocessed)    # Calculate the mean KL divergence    split_scores = []    for i in range(n_split):        part = preds[i * preds.shape[0] // n_split: (i + 1) * preds.shape[0] // n_split]        py = np.mean(part, axis=0)        scores = []        for p in part:            scores.append(entropy(p, py))        split_scores.append(np.exp(np.mean(scores)))    return np.mean(split_scores), np.std(split_scores) # Generate imagesn_samples = 1000noise = np.random.normal(0, 1, (n_samples, latent_dim))generated_images = generator.predict(noise) # Calculate Inception Scoreis_mean, is_std = calculate_inception_score(generated_images)print(f"Inception Score: {is_mean} ± {is_std}")

The code first imports necessary modules and defines a function 'calculateinceptionscore'. This function uses the InceptionV3 model from TensorFlow to predict the probability distribution of classes for each image. It then calculates the Kullback-Leibler (KL) divergence between the predicted distributions and the mean distribution, which is used to calculate the Inception Score.

A high Inception Score indicates that the model generates diverse and realistic images. The function returns the mean and standard deviation of the Inception Scores for a given set of images.

The last part of the code generates images from random noise using a 'generator' model, and then calculates and prints the Inception Score for these images.

Fréchet Inception Distance (FID)

The Fréchet Inception Distance measures the distance between the distributions of real and generated images. Lower FID scores indicate better quality and diversity of the generated images.

Formula:

FID=∣∣μr−μg∣∣2+Tr(Σr+Σg−2(ΣrΣg)1/2) where μr,Σr and μg,Σg are the means and covariances of the real and generated image distributions, respectively.

Example: Calculating FID

from numpy import cov, trace, iscomplexobjfrom scipy.linalg import sqrtm def calculate_fid(real_images, generated_images):    # Load InceptionV3 model    model = InceptionV3(include_top=False, pooling='avg', input_shape=(299, 299, 3))    # Resize and preprocess images    real_images_resized = tf.image.resize(real_images, (299, 299))    generated_images_resized = tf.image.resize(generated_images, (299, 299))    real_images_preprocessed = preprocess_input(real_images_resized)    generated_images_preprocessed = preprocess_input(generated_images_resized)    # Calculate activations    act1 = model.predict(real_images_preprocessed)    act2 = model.predict(generated_images_preprocessed)    # Calculate mean and covariance    mu1, sigma1 = act1.mean(axis=0), cov(act1, rowvar=False)    mu2, sigma2 = act2.mean(axis=0), cov(act2, rowvar=False)    # Calculate FID    ssdiff = np.sum((mu1 - mu2)**2.0)    covmean = sqrtm(sigma1.dot(sigma2))    if iscomplexobj(covmean):        covmean = covmean.real    fid = ssdiff + trace(sigma1 + sigma2 - 2.0*covmean)    return fid # Generate imagesn_samples = 1000noise = np.random.normal(0, 1, (n_samples, latent_dim))generated_images = generator.predict(noise) # Sample real imagesreal_images = x_train[np.random.choice(x_train.shape[0], n_samples, replace=False)] # Calculate FIDfid_score = calculate_fid(real_images, generated_images)print(f"FID Score: {fid_score}")

The script includes a function calculatefid(realimages, generated_images) that computes the FID score. It uses the InceptionV3 model from Keras to calculate activations of real and generated images. These activations are then used to compute the mean and covariance of the image sets.

The FID score is calculated as the sum of the squared difference between the means and the trace of the sum of the covariances minus twice the square root of the product of the covariances.

The function is then used with a set of real images and a set of generated images to compute a FID score. The generated images are created by a generator network from random noise, and the real images are sampled from a training set x_train. Finally, the FID score is printed.

4.5.3 Comparing with Baseline Models

To understand the performance of your GAN model, it’s useful to compare the results with baseline models. This could involve:

Comparing with a GAN trained with a different architecture.

Comparing with a GAN trained with different hyperparameters.

Comparing with other generative models like VAEs (Variational Autoencoders).

4.5.4 Addressing Common Issues

During evaluation, you might encounter common issues such as:

Mode Collapse: The generator produces limited diversity in the output images. This can be addressed by techniques such as minibatch discrimination, unrolled GANs, or using different loss functions.

Training Instability: The generator and discriminator losses oscillate significantly. This can be mitigated by using techniques like Wasserstein GANs (WGANs) or spectral normalization.

Summary

Evaluating a GAN involves both qualitative and quantitative methods to ensure that the generated images are realistic and diverse. Qualitative evaluation through visual inspection helps in identifying immediate issues, while quantitative metrics like Inception Score and Fréchet Inception Distance provide objective measures of performance. By systematically evaluating and comparing the model's outputs, you can identify areas for improvement and refine your GAN to produce high-quality images.