Why does the Linear module seems to do unnecessary transposing?

Adding to smth’s response, storing the second matrix of a matrix multiplication in transposed form may even increase efficiency. This is because the multiplication routine can access the memory in a more contiguous way, leading to fewer cache misses. See, e.g.,
https://stackoverflow.com/questions/18796801/increasing-the-data-locality-in-matrix-multiplication

Storing the second matrix in transposed form can easily lead to a ~5x speedup in a naive matrix multiplication implementation. The effect will be much smaller in pytorch because the underlying matrix multiplication routine is certainly more clever than the one in the above link.

3 Likes