Dimension of a scalar-by-vector derivative?

dwg · March 25, 2018, 6:16pm

I have a question about the dimension of the gradient produced by backward() for some very simple case. Suppose we have a dimension 2 vector z that is inner producted with a one vector to form the loss l. So clearly l = z^T 1 = 1^T z. It seems that the usual convention is to always have dl/dz to be the row vector (cf.https://mathinsight.org/derivative_matrix). However, the gradient dimension produced by backward() depends on whether z is a column vector or row vector. Maybe there is some deeper reason behind this design?

import numpy as np
import torch
from torch.autograd import Variable

dtype = torch.FloatTensor

# z is column vector
z = Variable(torch.ones(2, 1).type(dtype), requires_grad=True)
l = Variable(torch.ones(2, 1).t().type(dtype)).mm(z)
l.backward()
z.grad

# z is row vector
z = Variable(torch.ones(2, 1).t().type(dtype), requires_grad=True)
l = z.mm(Variable(torch.ones(2, 1).type(dtype)))
l.backward()
z.grad

The reason I ask is because this has implication on the dimension of grad_variables to supply to backward, if there is a chain of operations. For example, if we have the following equations
x = [[5, 1], [1, 5]] z
l = 1^T x
Then dl/dz should be [6 6].
To compute this gradient using the chain rule, it seems we should use dl/dx dx/dz, where dx/dz = [[5, 1], [1, 5]]. Then it seems natural to have dl/dx to be [1 1] in order for the computation to be feasible.
However, to compute this gradient using x.backward(grad_variables = vec), we need vec to be [1; 1], instead of [1 1].

Thanks!

jpeg729 · March 26, 2018, 9:32am

z.grad always matches the shape of z so that each element of z.grad is the gradient with respect to the corresponding element of z.

This makes it easy to apply updates to z. Optimizers all do something like this

p.data += -learning_rate * p.grad.data

I haven’t thought through the implications of the chain rule, but if you want to do this

myvar.backward(grads)

then grads should have the same shape as myvar.

When myvar has only one element and if you omit grads pytorch assumes that you meant to pass in Variable(torch.ones(myvar.size()))

dwg · March 26, 2018, 4:29pm

Thank you for your explanation! That seems a very reasonable rationale!