Yes, this I understand. What I’m puzzled about is that if I do
pself = copy.copy(self)
pself.bias.data *= 0
pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
Z = pself.forward(self.X) + 1e-9
S = torch.div(R, Z)
C = S.backward(torch.ones_like(S))
print(self.X.grad)
yields None, while
pself = copy.copy(self)
x = copy.copy(self.X)
pself.bias.data *= 0
pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
Z = pself.forward(x) + 1e-9
S = torch.div(R, Z)
C = S.backward(torch.ones_like(S))
print(x.grad)
does yield the same gradient as
C = torch.autograd.grad(S, self.X, torch.ones_like(S))
gives me