How the gradient in Linear layer be computed using transpose?

I’m wondering why i need to transpose the grad_output when calculate grad_weight?

  • the linear layer is to apply a linear transformation to the incoming data: y = xA^T + b
  • What i’m wondering is that how the gradient of weight matrix be computed by code like the following:
    If i define a backward func,i should write like this:
def backward(ctx , grad_output)
  imput,weight,bias = ctx.saved_tensors

  grad_input =
  grad_weight = grad_output.t().mm(input)
  grad_bias = grad_output.sum(0).squeeze(0)
  return grad_input,grad_weight,grad_bias

weight ,bias and input are cached by the forward action and use ctx.saved_tensors to reuse

  • I know it do make it right because only we transpose it can we have the correct shape to have a matrix multiply(
  • But is there any mathematical reason for the code of transpose?



If you write down the values for each entry in the matrix, you will see that the indices a flipped. Which means that in code, you should add a Transpose.

Thanks a lot , I will have a try