How to compute derivative of matrix output with respect to matrix input most efficiently?

schizoburger · September 2, 2018, 6:28pm

I have a following situation. An input has shape [BATCH_SIZE, DIMENSIONALITY] and an output has shape [BATCH_SIZE, CLASSES]. The gradient of the output with respect to the input should have shape [BATCH_SIZE, CLASSES, DIMENSIONALITY] (or similar). Currently, I use a following function to get the gradient:

def compute_grad(input, model):
    input.requires_grad_(True)
    grad = []
    output = model(input)
    for i in range(output.size(1)):
        grad.append(
            torch.autograd.grad(
                outputs=output[:, i],
                inputs=input,
                grad_outputs=torch.ones(len(input)),
                retain_graph=True, 
                create_graph=True
            )[0]
        )
    input.requires_grad_(False)
    return torch.stack(grad)

Is there a way to get the gradient without using for loop? Could this function be written better (more elegant and efficient)?

Ranahanocka · September 2, 2018, 9:19pm

Currently you are running the backward once for each example, but this can be parallelized with large batches on the GPU.
If you run the forward and then backwards on your model, i.e.,

input.requires_grad_(True)
out = model.forward(input)
optimizer.zero_grad()
loss = criterion(out,class_labels)
loss.backward()

Of course, you should pass the model parameters first to the optimizer, and criteria is cross entropy. And then you can simply extract the grads from the model:

grad=[]
for param in model.parameters():
        grad.stack(param.grad.data)

albanD · September 3, 2018, 1:38pm

Hi,

Autograd tools can only do Jacobian Vector products. So if you want to get a full Jacobian, you will need to run as many backward as the ouput size.
Unfortunately there is no way to make this more efficient in the general case at the moment.

schizoburger · September 3, 2018, 2:10pm

Thank you for the answer.

If the autograd tools can only do Jacobian Vector products, then, in my opinion, it’s quite confusing that you are able to specify a matrix with shape [n,m] for the grad_outputs parameter when the output is a matrix. For example, in the above scenario if I do

grad = torch.autograd.grad(
                outputs=output,
                inputs=input,
                grad_outputs=torch.ones(output.size()),
                retain_graph=True, 
                create_graph=True
            )[0]

then the grad has now shape [BATCH_SIZE, DIMENSIONALITY].

albanD · September 5, 2018, 7:31am

The thing is that for the Jacobian, your output is always considered as 1D (similar with the input).
That way you can see the Jacobian as a big 2D matrix.
And the vector you need to give for the dot product should be as big as the output itself.

schizoburger · September 27, 2018, 12:48pm

Does it mean that the input and the output are flattened?

albanD · September 27, 2018, 12:53pm

No they are not. This is just the reason why the shape of the gra_outputs doesn’t matter as long as it is the same as the output.

schizoburger · September 27, 2018, 1:11pm

Okay, but you say that the input is considered as 1D even though in my case the input is 2D. How is that resolved?

albanD · September 27, 2018, 1:23pm

There are two things here.
You net takes an ND tensor as input and outputs an ND tensor.
Pytorch asks an ND tensor for the grad_output and will return an ND tensor for the grad_inputs.

Now if you look at this mathematically, if you study the function that is defined by your network, you usually consider it takes a 1D tensor and outputs a 1D tensor.
This function’s jacobian will be a 2D tensor of size nb_in x nb_out.
When I say the pytorch performs Jacobian vector product. It is based on this mathematical formulation where the jacobian is a 2D tensor and the vector is a vector of size nb_out.

That being said, these mathematical objects are never actually created and pytorch works only with the ND tensors you give him. But it will return the result corresponding to these mathematical operations (though not in 1D but with the ND that you gave it initially).

Is that a clearer explanation?

schizoburger · September 27, 2018, 2:21pm

Yeah, I understand that. I am asking all these questions, because I want to understand mathematically what torch.autograd.grad(y, x, v) returns if y is N tensor, x is ND tensor and v is N tensor.

albanD · September 27, 2018, 4:08pm

So if we index y and grad_y with i and x and grad_x with j and k (assuming x is 2D but you can add more dimensions if you want).
Doing grad_x = torch.autograd.grad(y, x, grad_y).
Then grad_x[j, k] = sum_i (grad_y[i] * dy[i]/dx[j,k]).

schizoburger · September 27, 2018, 5:52pm

Thank you very much!