What's the proper way to decipher dimensions?

I am trying to get this sorted once and for all. How does Pytorch read dimensions? If I get my size of MNIST, and I see torch.tensor([28, 28]), I read width, height. What does Pytorch read if I feed that into a network (since I have to unsqueeze() it for it to work)? Does it read batch_size: 28, of a 1d tensor of [28] values?

Can someone help me with this or point me to some solid documentation/articles about it?

# get hypothetical tensor.size()
torch.tensor([28]) # [?]

torch.tensor([1, 16]) # [?, ?]

torch.tensor([12, 1, 6]) # [?, ?, ?]

torch.tensor([32, 1, 12, 12]) # [batch_size, channels, height, width]

torch.tensor([32, 3, 2, 16, 28]) # [batch_size, channels, depth, height, width]

Is the first dimension always batch_size? Are the last 2 always height x width?

I just can’t find the logic anywhere to understand what the clean systematic way of thinking of this is, or what the design choices were. Your help is greatly appreciated.

Lastly, when I am coming out of a conv layer, and want to pass into a Linear layer, which view() parameter should be -1, and which dimensions (using the terminology above) should be multiplied together?

example

# x.size() = torch.tensor([32, 21, 12, 12])
# should I flatten it like this?
x = x.view(-1, 21 * 12 * 12)

# or like this:
x = x.view(-1, 32 * 12 * 12)

#or like this:
x = x.view(32, -1)

# or other?

@ptrblck, you’re always great at bringing clarity to these sorts of things…

Each layer specifies the input and target dimensions.
E.g. in the docs of nn.Conv2d you see the input defined as [N, Cin​, Hin​, Win​].

Generally, all layers usually found in a CNN will accept this shape (pooling, normalization layers etc.).

Your example of the 5-dimensional tensor would be a possible input to nn.Conv3d.

However, you have to be careful when it comes to e.g. RNNs, as the default shape is expected as [seq_len, batch_size, features]. You could use batch_first=True while creating the RNN to use inputs as [batch_size, seq_len, features].

The design choices were most likely made due to performance reasons.

Right, I was reading that earlier. Am I making this more complicated than it is? What about the 3d, 2d and 1d tensors?

For instance, if I get the shape of one of my images, let’s just use MNIST. It’s [1, 28, 28]. To me, I’m reading [channel, height, width] as a human. As an input to a Conv2d, Pytorch just sees it as missing a dimension, and in a Conv1d would read [batch, channels, length]?

Yes, that is correct.
The input shape would throw an error for nn.Conv2d, while nn.Conv1d would treat dim2 as the sequence length.

And in this example:

>>> m = nn.Linear(20, 30)
>>> input = torch.randn(128, 20)
>>> output = m(input)
>>> print(output.size())
torch.Size([128, 30])

is the 128 considered batch_size both for input and output?

Yes, dim0 would correspond to the batch size.