DDP batch size division

Hi!

I am implementing a model using DDP (1 node 2 GPU) and am confused about the batch size. I am using a distributed data sampler with a dataloader to load my data. When initialising the dataloader I specify batch_size = 16. In the training loop each process then receives a batch of 16 making a total batch size of 32.

Does this behaviour sound correct? In the below text, it seems to me that the batch size could be automatically scaled by the DDP module.

"This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. The module is replicated on each machine and each device, and each such replica handles a portion of the input. During the backwards pass, gradients from each node are averaged.

The batch size should be larger than the number of GPUs used locally."

https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html

Any help much appreciated!