Gradient with respect to input with multiple outputs

Yes, this I understand. What I’m puzzled about is that if I do

pself = copy.copy(self)
        pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(self.X) + 1e-9
        S = torch.div(R, Z)
        C = S.backward(torch.ones_like(S))
        print(self.X.grad)

yields None, while

        pself = copy.copy(self)
        x = copy.copy(self.X)
        pself.bias.data *= 0
        pself.weight.data = torch.max(torch.DoubleTensor(1).zero_(), pself.weight)
        Z = pself.forward(x) + 1e-9
        S = torch.div(R, Z)
        C = S.backward(torch.ones_like(S))
        print(x.grad)

does yield the same gradient as

        C = torch.autograd.grad(S, self.X, torch.ones_like(S))

gives me