Should we split batch_size according to ngpu_per_node when DistributedDataparallel

You are correct. Each DataLoader instance pairs with a DDP instances. If you do not divide the batch-size=256 by 4, then each DDP instance will process 256 images. As your env has 8-GPUs in total, there will be 8 DDP instances. So one iteration will process 256 * 8 images in total.

However, DDP does divide the gradients by the world_size by default code. So, when configuring learning rate, you only need to consider the batch_size for a single DDP instance.

4 Likes