### I’m wondering why i need to transpose the grad_output when calculate grad_weight?

- the linear layer is to apply a linear transformation to the incoming data: y = xA^T + b
- What i’m wondering is that how the gradient of weight matrix be computed by code like the following:

If i define a backward func,i should write like this:

```
def backward(ctx , grad_output)
imput,weight,bias = ctx.saved_tensors
grad_input = grad_output.mm(weight)
grad_weight = grad_output.t().mm(input)
grad_bias = grad_output.sum(0).squeeze(0)
return grad_input,grad_weight,grad_bias
```

weight ,bias and input are cached by the forward action and use ctx.saved_tensors to reuse

- I know it do make it right because only we transpose it can we have the correct shape to have a matrix multiply(torch.mm)
- But is there any mathematical reason for the code of transpose?

THANKS!