Another question is if we do not divide batch-size by 8, the total images processed in one epoch will be the same as usual or eight times?
As for learning rate, if we have 8-gpus in total, there wiil be 8 DDP instances. If the batch-size in each DDP distances is 64 (has been divides manually), then one iteration will process 64×4=256 images per node. Taking all gpu into account (2 nodes, 4gpus per node), then one iteration will process 64×8=512 images. Assuming in one-gpu-one-node scenario, we set 1×lr when batch-size=64, 4×lr when batch-size=256 and 8×lr when batch-size=512(a universal strategy that increase learning rate with batch-size linearly). Let us back to DDP scenario (2 node, 4gpus per node), what learning rate shall we use? 1×lr or 4×lr or 8×lr?