I was also thinking about this, and found this issue:
From what i understand, transposing in forward pass has no overhead. But backward pass will be less efficient if
input.matmul(weight)