I have a following situation. An input has shape [BATCH_SIZE, DIMENSIONALITY] and an output has shape [BATCH_SIZE, CLASSES]. The gradient of the output with respect to the input should have shape [BATCH_SIZE, CLASSES, DIMENSIONALITY] (or similar). Currently, I use a following function to get the gradient:

def compute_grad(input, model):
input.requires_grad_(True)
grad = []
output = model(input)
for i in range(output.size(1)):
grad.append(
torch.autograd.grad(
outputs=output[:, i],
inputs=input,
grad_outputs=torch.ones(len(input)),
retain_graph=True,
create_graph=True
)[0]
)
input.requires_grad_(False)
return torch.stack(grad)

Is there a way to get the gradient without using for loop? Could this function be written better (more elegant and efficient)?

Currently you are running the backward once for each example, but this can be parallelized with large batches on the GPU.
If you run the forward and then backwards on your model, i.e.,

input.requires_grad_(True)
out = model.forward(input)
optimizer.zero_grad()
loss = criterion(out,class_labels)
loss.backward()

Of course, you should pass the model parameters first to the optimizer, and criteria is cross entropy. And then you can simply extract the grads from the model:

grad=[]
for param in model.parameters():
grad.stack(param.grad.data)

Autograd tools can only do Jacobian Vector products. So if you want to get a full Jacobian, you will need to run as many backward as the ouput size.
Unfortunately there is no way to make this more efficient in the general case at the moment.

If the autograd tools can only do Jacobian Vector products, then, in my opinion, itâ€™s quite confusing that you are able to specify a matrix with shape [n,m] for the grad_outputs parameter when the output is a matrix. For example, in the above scenario if I do

grad = torch.autograd.grad(
outputs=output,
inputs=input,
grad_outputs=torch.ones(output.size()),
retain_graph=True,
create_graph=True
)[0]

then the grad has now shape [BATCH_SIZE, DIMENSIONALITY].

The thing is that for the Jacobian, your output is always considered as 1D (similar with the input).
That way you can see the Jacobian as a big 2D matrix.
And the vector you need to give for the dot product should be as big as the output itself.

There are two things here.
You net takes an ND tensor as input and outputs an ND tensor.
Pytorch asks an ND tensor for the grad_output and will return an ND tensor for the grad_inputs.

Now if you look at this mathematically, if you study the function that is defined by your network, you usually consider it takes a 1D tensor and outputs a 1D tensor.
This functionâ€™s jacobian will be a 2D tensor of size nb_in x nb_out.
When I say the pytorch performs Jacobian vector product. It is based on this mathematical formulation where the jacobian is a 2D tensor and the vector is a vector of size nb_out.

That being said, these mathematical objects are never actually created and pytorch works only with the ND tensors you give him. But it will return the result corresponding to these mathematical operations (though not in 1D but with the ND that you gave it initially).

Yeah, I understand that. I am asking all these questions, because I want to understand mathematically what torch.autograd.grad(y, x, v) returns if y is N tensor, x is ND tensor and v is N tensor.

So if we index y and grad_y with i and x and grad_x with j and k (assuming x is 2D but you can add more dimensions if you want).
Doing grad_x = torch.autograd.grad(y, x, grad_y).
Then grad_x[j, k] = sum_i (grad_y[i] * dy[i]/dx[j,k]).