I have an input tensor with the shape of (32, 5). To create a nn.Linear layer from the input to the hidden layer (lets say we want our hidden layer to have 64 neurons), we would use linear1 = nn.Linear(5, 64). However, when I print linear1.weight.shape, it prints torch.Size([64, 5]). This means that during forward propagation, does Z = matmul(X, W.T).

My question is, why do they initialize the weight with the shape (output_nodes, input_nodes), rather than doing (input_nodes, output_nodes), which removes the need for transposing? If I decide to make a neural network using just NumPy, would initializing the weights with a shape (input_nodes, output_nodes) and doing Z = matmul(X, W) be incorrect?

First, we wish to multiply the input (your X) from the right by the weight (your W)
because X may have a leading batch dimension (or multiple leading “batch”
dimensions).

Second, when performing matrix multiplication from the right (not using the
transpose), you stride through W with a stride of out_features (64, in your
example). So you are moving through W non-locally. As a general rule of thumb,
this will be less efficient than striding locally through W with a stride of one. To
gain this efficiency, pytorch stores the weight matrix of a Linear in “transposed”
form (shape [64, 5], in your example).

How much (or whether) this helps depends on the details of your cpu or gpu
pipeline and the specific matrix-multiplication kernel used for the operation.
But, as a general rule, this optimization should help, sometimes significantly.

(Just to be clear, the transpose, W.T, is never explicitly computed nor stored. W.T would be a view into W that then lets the matrix multiplication use a stride
of one.)