 Trying to understand the input shape convention and notation

I am confused with the input shape convention that is used in Pytorch in some cases:

1. The nn.Layer’s input is of shape (N,∗,H_in) where N is the batch size, H_in is the number of features and ∗ means “any number of additional dimensions”. What exactly are these additional dimensions and how the nn.Linear is applied on them?

2. The nn.Conv1d’s input is of shape (N, C_in, L) where N is the batch size as before, C_in the number of input channels, L is the length of signal sequence.

3. The nn.Conv2d’s input is of shape (N, C_in, H, W) where N is the batch size as before, C_in the number of input channels, H is the height and W the width of the image.

4. The nn.BatchNorm1d’s input is of shape (N, C) or (N, C, L) where N is the batch size as before. However what does the C and L denote here? It seems that C = number of features, L = number of channels, based on the description in documentation “2D or 3D input (a mini-batch of 1D inputs with optional additional channel dimension)”. This is inconsistent with the nn.Conv1d notation.

5. The nn.BatchNorm2d’s input is of shape (N, C, H, W) where N is the batch size as before, H and W are the height and width of the image respectively. What does the C denote here? Is it the number of features as in nn.BatchNorm1d or the number of channels as in nn.Conv2d? It seems to be the number of channels since we are talking about “a 4D input (a mini-batch of 2D inputs with additional channel dimension)”, but then in documentation we have the line “num_features – C from an expected input of size (N, C, H, W)”, so C is both number of channels and number of features which is weird. So perhaps num_features should be renamed to num_channels.

1. Any number of dimensions are supported and the linear layer would use them as if you would loop through these dimensions as seen here:

lin = nn.Linear(10, 10)
x = torch.randn(2, 10, 10)
out = lin(x)
out2 = torch.stack([lin(x_.squeeze(1)) for x_ in x.split(1, dim=1)], dim=1)
print((out - out2).abs().max())
1. and 3. yes, that’s correct.

2. nn.BatchNorm1d accepts inputs in [N, C, L], where C is optional. C is the channel dimension and num_features in the batchnorm layers will use either C or L depending on the input shape as described in the docs:

• num_features – C from an expected input of size (N,C,L) or L from input of size (N,L)

The part in the Shape section is inconsistent, as it mentions the inputs as (N, C) or (N, C, L). A PR is welcome in case you want to fix it.

1. C denotes the channel dimension and thus the features in the batchnorm layer.

Why is this weird? This dimensions is used as the channel dimension in the input and named features in the normalization layer.

This might be a possibility for batchnorm, but wouldn’t match other norm layers, which do not necessarily use the channel dimension only.

Thank you @ptrblck for your detailed answer. It has clarified a lot for me.

It’s clear that for nn.Conv1d, nn.Conv2d, nn.Batchnorm2d and nn.Batchnorm1d (with a 3D input of shape (B, C, L)):

• The letter C denotes the number of channels which is the same as the number of features.
• The letter L denotes the length of signal sequence (as described in the nn.Conv1d documentation).
• The letters H and W denote the height and width of image respectively. This is consistent with the 1d because we could flatten an HxW image to a sequence of length L = H*W.

The confusion arises in the case of a 2D input for nn.Batchnorm1d. I agree that the shape of a 2D input should be changed to (B, L) in the shape section of the documentation, however in this case the meaning of L changes automatically from “length of a signal sequence” to “number of features” (and the notion of channels doesn’t exist anymore). In my opinion, a different letter should be used for this case, e.g. (B, H_in) as in nn.Linear, to avoid confusion.

In conclusion, the 2D input and the 3D input to nn.Batchnorm1d are two different things. The first is a mini-batch of 1D vectors of features, the second is a mini-batch of signal sequences each of length L and with C channels/features. We cannot say that the first is equivalent to the second with shape (B, C=1, L). So in my opinion the shape of the first should be denoted by (B, H_in) and of the second by (B, C, L).