New to distributed training and I found this discussion about the effective batch size with DDP. Should we split batch_size according to ngpu_per_node when DistributedDataparallel
I wonder should we also change the number of epochs with DDP? Say for the single card training, if the number of epochs is 10, it means the model will go through the training dataset 10 times. With DDP and 4 GPUs, the data loader should be copied to 4 GPUs. Does that mean in 1 epoch, 1 GPU goes through the whole dataset and the training dataset is used 4 times? In that case, if we use 4 GPUs, should we set # of epochs to # of epochs / # of GPUs?
Thanks for any explanation!