The argument is not about them having different values. Just that gradients really correspond to Tensors. And so when you do a[0], you get a branch new Tensor.
What I ended up coming with for this task of taking gradients of loss with respect to some views of my tensor as needed was very hacky.
a_np = np.random.random((3,1))
a = [torch.autograd.Variable(torch.DoubleTensor(element),requires_grad=True) for element in a_np]
#to give to functions that expect full tensor
a_intermediate = torch.stack(a)
output = (2 * a_intermediate).sum()
print(torch.autograd.grad(output, a[0:]))
But I had wanted to do
a_np = np.random.random((3,1))
a = torch.autograd.Variable(torch.DoubleTensor(a),requires_grad=True)
output = (2 * a).sum()
print(torch.autograd.grad(output, a[:]))
In the case where I wanted to compute the gradient of output with respect to a fully contiguous slice of its tensor inputs.
On the view discussion it seems
is really the key disconnect. I was reading this to be a much stronger statement than it is, such that a view contained all the memory like graph connections inside the original tensor. Instead it appears its purely the underlying data/values that are shared.
Update: I believe this post will be beneficial to others to understand how to calculate Jacobian:
and
https://pytorch.org/functorch/stable/notebooks/jacobians_hessians.html