# nn.Linear weight shape initialization confusion

I have an input tensor with the shape of (32, 5). To create a `nn.Linear` layer from the input to the hidden layer (lets say we want our hidden layer to have 64 neurons), we would use `linear1 = nn.Linear(5, 64)`. However, when I print `linear1.weight.shape`, it prints `torch.Size([64, 5])`. This means that during forward propagation, does `Z = matmul(X, W.T)`.

My question is, why do they initialize the weight with the shape `(output_nodes, input_nodes)`, rather than doing `(input_nodes, output_nodes)`, which removes the need for transposing? If I decide to make a neural network using just NumPy, would initializing the weights with a shape `(input_nodes, output_nodes)` and doing `Z = matmul(X, W)` be incorrect?

Hi Daniel!

I speculate as follows:

First, we wish to multiply the input (your `X`) from the right by the weight (your `W`)
because `X` may have a leading batch dimension (or multiple leading “batch”
dimensions).

Second, when performing matrix multiplication from the right (not using the
transpose), you stride through `W` with a stride of `out_features` (`64`, in your
example). So you are moving through `W` non-locally. As a general rule of thumb,
this will be less efficient than striding locally through `W` with a stride of one. To
gain this efficiency, pytorch stores the `weight` matrix of a `Linear` in “transposed”
form (shape `[64, 5]`, in your example).

How much (or whether) this helps depends on the details of your cpu or gpu
pipeline and the specific matrix-multiplication kernel used for the operation.
But, as a general rule, this optimization should help, sometimes significantly.

(Just to be clear, the transpose, `W.T`, is never explicitly computed nor stored.
`W.T` would be a view into `W` that then lets the matrix multiplication use a stride
of one.)

Best.

K. Frank