I was looking at the code for torch.nn.Linear(in_features, out_features, bias=True)
and it seems that it store the matrix one way but then decides that to compute stuff its necessary to transpose (though the transposing seems it could have been avoided). Why does it store a matrix with dimensions (code http://pytorch.org/docs/master/_modules/torch/nn/modules/linear.html#Linear):
self.weight = Parameter(torch.Tensor(out_features, in_features))
to then go ahead an compute the linear transform as follows:
def forward(self, input):
return F.linear(input, self.weight, self.bias)
which points to (http://pytorch.org/docs/master/_modules/torch/nn/functional.html#linear):
output = input.matmul(weight.t())
can’t this just be avoided by respecting the order of how the dimensions were given and then do a matrix multiply without the transpose?
def __init__(self, in_features, out_features, bias=True):
....
self.weight = Parameter(torch.Tensor(in_features,out_features))
then just do:
input.matmul(weight)
and we avoid having to rotate data? Maybe rotating the data doesn’t happen as I think in hardware and the data is not moved at all or something, but it just seems really unnecessary.
Besides even if it wasn’t inefficient, its really weird to me that the data is represented as row vectors (i.e. one row is a data point so the rows span the space of data points in the original raw dimension) but then the weight vector is stored as a D_out x D_in which seems to imply that the D_out is the target dimension where we land to so it seems odd to start thinking about row vectors to then randomly switch to column vectors. Why was this done?
Plus when one does linear.weight
it was surprising to me to discover that the shape of the parameters were all switched from what I initially write when I created my linear layer. Maybe its just me but it seems super odd and confusing.