Why the grad is chagnging

I am using a Linear network, the structure of the network is following:


as you can see, This is a linear network and no activation function.
I think when I am using backward function, The gradient of the input vector is supposed to be constant, but the result seems change. The following image show the grad of full zero vector

when I change the input vector, The result changes, like the following image

Is there something wrong?

The gradient is not constant, it’s dependent on the input. Gradient is the derivative of the loss w.r.t. the weights. When the input is all zeros, the derivative is going to be zero everywhere (since d(0*W)/d(W) = 0) however when the input is nonzero somewhere, the derivative is no longer zero everywhere.

Thanks for your answer, but here I am getting the derivative of the loss w.r.t. the input. It supposed to be d(x*W)/d(x)=W.

We have two parallel conversations going on the same question, but basically I think when you run output.backward() it computes dy/dw not dy/dx.

so, maybe my calculation method is wrong?

I tried this and it seems to give the same answer when I change the x, so I think this might be what you want:

net = nn.Linear(5, 10)

Then, for 2 different X, we get the same dy/dx:

input_tensor = torch.zeros(1, 5)
input_tensor.requires_grad_()
torch.autograd.grad(net(input_tensor).sum(), input_tensor, create_graph=True)

(tensor([[-1.0625, -0.4604, 0.7039, 0.1229, -1.1627]], grad_fn=<MmBackward>),)
input_tensor = torch.ones(1, 5)
input_tensor.requires_grad_()
torch.autograd.grad(net(input_tensor).sum(), input_tensor, create_graph=True)

(tensor([[-1.0625, -0.4604, 0.7039, 0.1229, -1.1627]], grad_fn=<MmBackward>),)

Which exactly match the weight sums:

net.weight.sum(dim=0)

tensor([-1.0625, -0.4604,  0.7039,  0.1229, -1.1627], grad_fn=<SumBackward1>)

Turns out the issue was that the existence of nonlinear layers in the model (e.g. RELU) means that dx/dy does depend on x.

Yes, I have found the reason. The network has a normalization towards the input and the output of the network, making it no longer a linear network. So the dy/dx is dependent on the x