I am training a model that returns a gradient of the final fully connected layer with respect to the inputs for each minibatch (i.e. first element of returned value is gradient of first value of self.fc3(out).squeeze() with respect to the x[0]):

```
def forward(self, x):
out = self.conv1(x)
out = self.relu1(out)
out = self.conv2(out)
out = self.relu2(out).view(x.shape[0], -1)
out = self.fc1(out)
out = self.relu3(out)
out = self.fc3(out).squeeze()
out = torch.autograd.grad(out, x, grad_outputs=torch.ones_like(out), create_graph=True)[0]
return out
```

Here, the input is of size (10, 1, 10000) and the output is of size (10, 1, 10000). This part seems to work correctly, but in the main loop I need to compute the gradient again in a similar way:

```
samples = samples.reshape((samples.shape[0], 1, samples.shape[1] * samples.shape[2])).float().requires_grad_(True)
outputs = model(samples)
out1 = (outputs @ v).squeeze()
out1 = torch.autograd.grad(out1, samples, grad_outputs=torch.ones_like(out1), create_graph=True)[0].float()
```

(outputs @ v).squeeze() results in a 1d tensor with size 10. Since, I set create_graph to true in the forward method, I would expect that I can call autograd.grad on out1 with respect to samples and get a non-zero value. However, the result is that out1 is a (10, 1, 10000) of all 0s, rather than one that is filled out with the correct gradients.