I would like well understand how a autograd works. I decided to make a simple experiment: generating noise image (from human point-of-view), which returns high confidence for a target class, similar on how FGSM attacks work. I do the following steps:
loading trained CNN model:
target = torch.tensor()
generating random/noised image:
image = torch.rand((1, 3, 32, 32))` image.requires_grad = True
passing it through the model:
output = net(image)
criterion = torch.nn.CrossEntropyLoss() cost = criterion(output, target)
grad = torch.autograd.grad(cost, image, retain_graph=False, create_graph=False)
modifying original noised image:
ideal_image = image - grad.sign() #.clamp(min=0.0, max=1.0)
There are some simplifying in codes.
I expect that the new image will be classified by the network with 100% confidence - as the class 5 (my target). In the FGSM attack, an eps parameter is added, but here I wanted to create the perfect image example for my network, so eps=1. However, it was not going to happen - the network returns nearly a random top class.
What I did wrong? I thought that if I subtracted the grad calculated from the loss to the noised image, then for the network point of view the new image will be the ideal example of the target class. Maybe, it works differently?