Gradient with respect to input with multiple outputs


(Sami) #1

Hey guys!
I’ve posted a similar topic and have read all topics that I found about that topic, but I just can’t seem to get it.

I’m trying to implement relevance propagation for convolutional layers.
For this, I need to calculate the gradient of a given layer with respect to its input. Since I have to calculate this gradient for intermediate layers, I do not have a scalar value at my output, but a multidimensional tensor. What I want to achieve is

16
with

The code for that would look like this:

    def relprop(self, R):
        pself = copy.copy(self)
        pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(self.X) + 1e-9
        S = torch.div(R, Z)

        C = S.backward()
        print(self.X.grad)
        R = self.X * C
        return R

The backwards function should basically compute S * pself.weight, since I try to get the gradient with respect to X, but do this effectively.
My problem now is that I can not calculate the gradient of S, since this is only supported for scalar values. I know I can specify a gradient vector to put into backward(), but I can for the life of me not figure out what to plug in.


(Justus Schock) #2
C = S.backward(torch.ones_like(S))

Should work for you. It’s the same way I’m using, when I have to deal with a per-pixel-loss in segmentation.


(Sami) #3

This gives me None, unfortunately :confused:


(Justus Schock) #4

Oh, I’ve overlooked it somehow.

C = S.backward(torch.ones_like(S))

is intended to give you None as result, since it modifies the Parameters gradients in-place

you might want to use torch.autograd.grad instead


(Sami) #5

What do I assign the output parameters? Any tensor of the size of S?

        torch.autograd.grad(C, S, torch.ones_like(S))

where I just create a tensor like ‘torch.ones_like(S)’?
I’ve also tried the following:

        x = copy.copy(self.X)
pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(self.X) + 1e-9
        S = torch.div(R, Z)

        C = S.backward(torch.ones_like(S))

while printing the .grad of self.X yielded non, the newly copied x variable does have a gradient.


(Justus Schock) #6

This is how I would understand the docs, since the previous gradients should be multiplied to the ones calculated from your operations.

This is weird, since the gradients should be calculated only for the inputs of the operations in the dynamically created graph. Are you sure this is not because of some previous tries which might be still cached or something like that?


(Sami) #7

Sorry, I have copied the wrong code, apparently. I have, of course, run the forward pass with the newly copied variable x.
I’m still wondering why the gradient was not saved into the self.X variable?
Thanks for your help, by the way!


(Justus Schock) #8

If you would have only done x = self.X the tensors would still share the same storage and therefore gradients calculated for x would also apply to self.X as they would also share storage. Since you do x = copy.copy(self.X), new storage for x (and it’s gradients) will be allocated. This decouples the tensors from each other and they only have the same underlying data at the beginning (but not the same storage any more!)


(Sami) #9

Yes, this I understand. What I’m puzzled about is that if I do

pself = copy.copy(self)
        pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(self.X) + 1e-9
        S = torch.div(R, Z)
        C = S.backward(torch.ones_like(S))
        print(self.X.grad)

yields None, while

        pself = copy.copy(self)
        x = copy.copy(self.X)
        pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(x) + 1e-9
        S = torch.div(R, Z)
        C = S.backward(torch.ones_like(S))
        print(x.grad)

does yield the same gradient as

        C = torch.autograd.grad(S, self.X, torch.ones_like(S))

gives me


(Justus Schock) #10

What’s the problem about that? I think that’s exactly how it should work.


(Sami) #11

The thing I don’t understand is that using forward with

        Z = pself.forward(self.X) + 1e-9
        [...]
        C = S.backward(torch.ones_like(S))
        print(self.X.grad)

yields None, while

        Z = pself.forward(x) + 1e-9
        [...]
        C = S.backward(torch.ones_like(S))
        print(x)

gives me a gradient


(Justus Schock) #12

Now, I’m kinda confused too. Can you please run each of the following snippets and post their results?

First:

 pself = copy.copy(self)
        x = copy.copy(self.X)
        pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(x) + 1e-9
        S = torch.div(R, Z)
        C = S.backward(torch.ones_like(S))
        print(x.grad)

Second:

        self.bias.data *= 0
        self.weight.data = torch.max(torch.DoubleTensor(1).zero_(), self.weight)
        Z = self.forward(self.X) + 1e-9
        S = torch.div(R, Z)
        C = S.backward(torch.ones_like(S))
        print(self.X.grad)

Third:

pself = copy.copy(self)
        x = copy.copy(self.X)
        pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(x) + 1e-9
        S = torch.div(R, Z)
C = torch.autograd.grad(S, x, torch.ones_like(S))

And fourth:

        self.bias.data *= 0
        self.weight.data = torch.max(torch.DoubleTensor(1).zero_(), self.weight)
        Z = self.forward(self.X) + 1e-9
        S = torch.div(R, Z)
C = torch.autograd.grad(S, self.X, torch.ones_like(S))

Also note, that you should not do something like module.forward(x) but module(x) instead (if module is an instance of torch.nn.Module) since this will internally call the forward function but do some other stuff which is needed for hooks etc to make them work properly.


(Sami) #13

True, I’ll change it to pself(variable) instead of the forward call!

The first one does yield a gradient.
The second yields None
Third yields a gradient
Fourth as well


(Justus Schock) #14

That’s strange. At least the first and third and the second and fourth approaches should be identical.

How do you create self.X?


(Sami) #15

It is created in the forward pass

    def forward(self, input):
        # Input shape: minibatch x in_channels, iH x iW
        self.X = input
        return super().forward(input)

(Justus Schock) #16

Can you try it with just some other random tensors instead of self.X to ensure that it is not a unfortunate operation on self.X which is causing that behavior?


(Sami) #17

Using

x = torch.randn((self.X.shape), requires_grad=True)

does indeed work. It must be something with self.X sharing the same memory, with pself, so when I use pself.forward, it overwrites the old self.X, so that the new one does not have a gradient…


(Justus Schock) #18

OK, if you want you can either create a gist or pm me your code that I can have a look at your whole class if you don’t find the mistake by yourself