I have a question about the dimension of the gradient produced by backward() for some very simple case. Suppose we have a dimension 2 vector z that is inner producted with a one vector to form the loss l. So clearly l = z^T 1 = 1^T z. It seems that the usual convention is to always have dl/dz to be the row vector (cf.https://mathinsight.org/derivative_matrix). However, the gradient dimension produced by backward() depends on whether z is a column vector or row vector. Maybe there is some deeper reason behind this design?
import numpy as np
import torch
from torch.autograd import Variable
dtype = torch.FloatTensor
# z is column vector
z = Variable(torch.ones(2, 1).type(dtype), requires_grad=True)
l = Variable(torch.ones(2, 1).t().type(dtype)).mm(z)
l.backward()
z.grad
# z is row vector
z = Variable(torch.ones(2, 1).t().type(dtype), requires_grad=True)
l = z.mm(Variable(torch.ones(2, 1).type(dtype)))
l.backward()
z.grad
The reason I ask is because this has implication on the dimension of grad_variables to supply to backward, if there is a chain of operations. For example, if we have the following equations
x = [[5, 1], [1, 5]] z
l = 1^T x
Then dl/dz should be [6 6].
To compute this gradient using the chain rule, it seems we should use dl/dx dx/dz, where dx/dz = [[5, 1], [1, 5]]. Then it seems natural to have dl/dx to be [1 1] in order for the computation to be feasible.
However, to compute this gradient using x.backward(grad_variables = vec), we need vec to be [1; 1], instead of [1 1].
Thanks!