Batch size behaves differently between CPU and multiple GPUs

I am facing an issue where my batch size of 16 seems to automatically change to 4 batches of 4 when running my code on 4 GPUs, and the output is not returned to me as an output of batch size 16.

Here is the relevant part of my training loop, where I first print the input batch shape, then pass it to my model, and finally print the output shape. The encoder model also prints the input shape as soon as it receives the input (which is giving a mismatch while training on multiple GPUs)

print("Encoder input shape: ", input_tensor.shape)
encoder_output, (encoder_hidden, encoder_cell) = encoder(input_tensor)
decoder_input = torch.squeeze(encoder_hidden, 0)
print("Decoder input shape: ", decoder_input.shape)
class Encoder(nn.Module):
    .
    .
    .
    def forward(self, context_panels):
        print("Input Context Panels shape: ", context_panels.shape)
        encoded_panels = self.panel_encoder(context_panels)
        print("Panel encoder output shape: ", encoded_panels.shape)
        output, (hidden, cell) = self.sequence_encoder(encoded_panels)
        return output, (hidden, cell)

This is the console output I get on CPU training, which is the expected behavior:

Encoder input shape:  torch.Size([16, 3, 160, 160])
Input Context Panels shape:  torch.Size([16, 3, 160, 160])
Panel encoder output shape:  torch.Size([16, 3, 128])
Decoder input shape:  torch.Size([16, 128])

However, this is what I get when I run on 4 GPUs:

Encoder input shape:  torch.Size([16, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Panel encoder output shape:  torch.Size([4, 3, 128])
Panel encoder output shape:  torch.Size([4, 3, 128])
Panel encoder output shape:  torch.Size([4, 3, 128])
Panel encoder output shape:  torch.Size([4, 3, 128])
Decoder input shape:  torch.Size([4, 4, 128])

Any feedback on what I am doing wrong is greatly appreciated!

EDIT:

So the batch size is getting resized currently for the encoder output from an LSTM, but the encoder hidden features are not being reshaped in a similar manner as I expect them to.

CPU output:

Encoder input shape:  torch.Size([16, 3, 160, 160])
Input Context Panels shape:  torch.Size([16, 3, 160, 160])
Panel encoder output shape:  torch.Size([16, 3, 128])
Encoder output shape:  torch.Size([16, 3, 128])
Encoder hidden shape:  torch.Size([1, 16, 128])
Decoder input shape:  torch.Size([16, 128])

4 GPU output:

Encoder input shape:  torch.Size([16, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Input Context Panels shape:  torch.Size([4, 3, 160, 160])
Panel encoder output shape:  torch.Size([4, 3, 128])
Panel encoder output shape:  torch.Size([4, 3, 128])
Panel encoder output shape:  torch.Size([4, 3, 128])
Panel encoder output shape:  torch.Size([4, 3, 128])
Encoder output shape:  torch.Size([16, 3, 128])
Encoder hidden shape:  torch.Size([4, 4, 128])
Decoder input shape:  torch.Size([4, 4, 128])

Right, if you are using DataParallel or DistributedDataParallel the input will automatically be split across the batch dimension for the different devices (in this case the 4 GPUs) so that is why there are 4 calls each with 1/4 the total original batch size. This is expected behavior and the output you get should be coalesced to the full batch size.

Is this causing an error for you somewhere else?

Thanks for the clarification! I am indeed using DataParallel()

Please see the edit I made. The split batch dimensions are getting coalesced for the output from my LSTM encoder, but for the hidden features the dimensions are not being coalesced in the same way. The hidden features are not being returned as 1x16x128 (num_layers x batch_size xhidden_size) but instead as 4x4x128 on multi gpu training.

The coalescing might be getting confused by the ordering of the dimensions. Can you temporarily permute() the batch dimension to the first dimension and then permute() again as necessary for the desired data layout?

Sure I could always do that! But I feel the default behaviour for data parallelisation should be to re-organise the batches for any tensor that gets returned from the model. I wonder if I should create a GitHub issue to get the devs’ opinions

The issue is that I don’t think tensors have “named” dimensions at the moment (a prototype feature here). So there is no way for PyTorch to actually know which dimension is the batch dimension (e.g., imagine your features have dimensions of (4, 4, 4, 4) or (16, 16, 16, 16)). So the current default behavior often assumes the first dimension is the batch.