Learning the image to maximize goldfish category

I created this code but something is wrong, the input image should learn the best goldfish representation because I use the goldfish ImageNet category which is second category. Starting from the random image I should update the input image for a number of iterations in code.
However, the end result is I get the loss down to zero, but the image still looks like a random noise. Can you please find where I am wrong:

import torch
import torchvision.models as models
import torch.optim as optim
from PIL import Image
import numpy as np
import torch.nn.functional as F  # Import functional interface for torch.nn
from torchvision.models.vgg import VGG16_Weights

# Load a pre-trained VGG16 model
vgg16 = models.vgg16(weights=VGG16_Weights.IMAGENET1K_V1)
vgg16.eval()  # Set the model to evaluation mode

# Index of the target class (goldfish in this case)
target_class = 1  # Index 1 corresponds to 'goldfish' in ImageNet classes

# Define the optimizer (to update the input image)
input_image = torch.randn(1, 3, 224, 224).requires_grad_(True)  # Initialize input image with random noise
optimizer = optim.SGD([input_image], lr=0.05)  # Use the input image as parameter to optimize

# Number of optimization steps
num_iterations = 1000

# Optimization loop
for i in range(num_iterations):
    print(i)
    optimizer.zero_grad()  # Zero gradients
    
    # Forward pass through VGG16 with the current input image
    output = vgg16(input_image)
    
    # Apply Softmax function to the output tensor
    probabilities = F.softmax(output, dim=1)
    
    # Compute the loss - maximize the probability of the target class
    loss = -probabilities[0, target_class].log()  # Negative log probability of the target class
    print(loss)
    
    # Backpropagation: Compute gradients of the input image wrt the loss
    loss.backward()
    
    # Update the input image using the gradients (gradient ascent)
    optimizer.step()
    
    # Clamp the image pixel values to stay within valid range (0-1)
    #input_image.data.clamp_(0, 1)

    if i % 10 == 0:
        # Convert the optimized input image tensor to a PIL image
        generated_image = input_image.squeeze(0).detach().cpu().numpy()
        generated_image = np.moveaxis(generated_image, 0, -1)  # Change from [C, H, W] to [H, W, C]
        generated_image = (generated_image * 255).astype(np.uint8)  # Rescale to [0, 255] for PIL

        # Save the generated image
        image_output = Image.fromarray(generated_image)
        image_output.save("generated_goldfish_image" + str(i) + ".jpg")

image

Hi Blackbird!

The short story: To create a synthetic image that looks like a goldfish,
you should train a generative adversarial network.

I haven’t looked at your code, so maybe there’s a bug in it somewhere.

However, there’s an important conceptual issue here.

Your pretrained vgg16 has been trained so that if you present it with a
picture of a goldfish, it should classify it as goldfish rather than, say, a
magpie or a scorpion. You would also expect that if you present it with an
image of a coronavirus, it will not predict any class with a high probability.
(Some classes will have higher probabilities and some lower, but none
should be predicted to be very likely or certain.) Likewise, if you present
the model with a random-noise image, you should get no clear prediction.

However, the vgg16 as not been trained to not be “fooled” by a non-random
image (that happens to look like random noise to the human eye) that you
have carefully constructed to “look like” a goldfish (in some weird way) to your
specific pretrained vgg16.

Figuratively speaking, the image you have constructed has some pixels that
look like a goldfish texture, some that look like goldfish coloring, some that
look like the ghost of a goldfish eye, and so on, but even though all of these
goldfish cues don’t fit together into anything resembling a goldfish, your
model sees enough of these goldfish cues (and no magpie cues, etc.) and
predicts “goldfish” with high probability.

And furthermore, you’ve constructed your image to contain those particular
goldfish cues that this particular vgg16 cares about.

Consider the following experiment: Train two vgg16s from scratch separately,
starting with two different random initializations. Train on the same dataset
(but randomly shuffle your batches differently, if you so choose). Use the
first vgg16 to construct your trick image. It will look like random noise, but
the first vgg16 will classify it as a goldfish. However, the second vgg16
will not predict any specific class with high probability – that is, it will look
like random noise to the second vgg16.

However, when you train a generative adversarial network, you are both
training a generator network to produce images that look real (both to
the human eye and another network) while simultaneously training a
discriminator network to correctly classify real images while not being
“fooled” by your synthetic images.

Initially, your generator will learn to constructed the specific goldfish-like
cues (that don’t really look like a goldfish) that the current version of the
discriminator recognizes as “goldfish.” But as you train the discriminator,
future versions of it have learned that those particular cues were synthetic,
so it doesn’t recognize those cues as “goldfish” any more. This process
goes back and forth until your generator learns to construct images that
really do look like goldfish and whose cues can no longer be recognized
as being synthetically generator.

Best.

K. Frank

Hi KFrank,

Thank you for the feedback very much.
It was not my intention to create the adversarial network to do feature visualization at this time.

The code I shared is capable of running in a colab notebook, and it should take few minutes to execute. The process of updating the input image is present because I check files being created with MD5 checksum.

I even tried to start from a goldfish image and to maximize the end category by updating the input image. I expected the input image will have even more goldfish artifacts at the end, for instance several fish eyes, fish scales, etc.

However, even thought in each step the image is different from MD5 perspective it looks the same as the original image.

The inspiration I took from the great article

And particularly I am using this image and the latest example where I set the category of goldfish after softmax.

It should work, but I don’t know why in my case is not working, maybe I have some bug, and maybe proper initialization is missing.