Hello, I am trying to compute the gradient of the following quadratic expression: x.T@A@x where x is a vector with n samples and m features and A is a matrix m times m, the direct derivative of this expression with respect to x is x@(A + A.T), I am trying to get this derivative using autograd function but I couldn’t do it, I implemented it this way:

``````n_samples = 5
n_features = 2
x = torch.randint(0, 5, (n_samples, n_features)).to(torch.float32).requires_grad_(True)
A  = torch.randint(0, 5, (n_features, n_features)).to(torch.float32)#.requires_grad_(True)

loss = (x @ A @ x.T).sum()
loss.backward()
grad2 = x @ (A + A.T)

``````

However, both results are different, the grad result corresponds to the sum over columns of grad2, for example:

``````grad = tensor([[96., 78.],
[96., 78.],
[96., 78.],
[96., 78.],
[96., 78.]])
[16., 14.],
[20., 18.],
[28., 22.],
``````

I understand that this behavior is maybe due to the fact that I am using the .sum() operator before backward(), but I would like to know if I can get the same result as in the manual derivation. I appreciate any help and further explanation about this. Thanks in advance.

Hi Xavier!

It’s not entirely clear what you are trying to compute here. (Note that
here you have the transpose in the wrong place, but you have it correct

Correcting the transpose, your expression `x @ A @ x.T` has shape
`[n_samples, n_samples]`, so it’s unclear what you want to differentiate
with respect to `x`.

This formula gives you a batch (of length `n_samples`) of gradients (each
of which has length `n_features`) of scalar quadratic forms using `A`. So I
speculate that you want to compute `x[i] @ A @ x[i]` on a batch basis
and then compute the gradients of these scalars with respect to `x[i]` on
a batch basis.

You can use autograd to do what (I think) you want, but you have to
somehow select out the individual batch elements that you want.

Consider this tweaked version of your example code:

``````import torch
print (torch.__version__)

_ = torch.manual_seed (1066)

n_samples = 5
n_features = 2
x = torch.randint(0, 5, (n_samples, n_features)).to(torch.float32).requires_grad_(True)
A  = torch.randint(0, 5, (n_features, n_features)).to(torch.float32)#.requires_grad_(True)

loss = (x @ A @ x.T).sum()              # this includes  x[i] @ A @ x[j] terms
loss.backward()

loss = (x @ A @ x.T).diagonal().sum()   # keep just the  x[i] @ A @ x[i] terms
loss.backward()

grad2 = x @ (A + A.T)                   # batch of gradients for a batch of scalar quadratic forms

for  xi in x:                           # compute batch of gradients with explicit loop
(xi @ A @ xi).backward()
``````

Here is its output:

``````2.1.0
tensor([[68., 74.],
[68., 74.],
[68., 74.],
[68., 74.],
[68., 74.]])
tensor([[10., 20.],
[18., 14.],
[ 6.,  8.],
[14., 12.],
[20., 20.]])
tensor([[10., 20.],
[18., 14.],
[ 6.,  8.],
[14., 12.],
tensor([10., 20.])
tensor([18., 14.])
tensor([6., 8.])
tensor([14., 12.])
tensor([20., 20.])
``````

This shows how you can use a loop to compute the batch of gradients
and also how you can do it using a single, loop-free call to `.backward()`.

Yes, the `.sum()` is your problem (based on my assumption that you want
to focus on the individual scalar quadratic forms, `x[i] @ A @ x[i]`). In your
expression, `.sum()` is also summing over the cross terms `x[i] @ A @ x[j]`.
There are various ways to deal with this, one being to use `.diagonal()` to
pluck out just the terms you want. The loop version in my example script
verifies that `.diagonal().sum()` does, in fact, give you what (I think) you
want.

Best.

K. Frank

1 Like

That is what I was willing to do, thank you so much!