Number of epochs in DistributedDataParallel


I wrote a program recently to train an AlexNet model on ImageNet dataset. I don’t want to share the code for now. I just want to make sure I understand something correctly. I didn’t find the answer anywhere. I understand that if my original batch size is for example 128, for parallel training I need to set the batch size to 256 given I use 2 GPUs. But, I am confused about the number of epochs. If I want to train the model for 90 epochs, do I set the number of epochs to 45 for 2 GPUs?

Your script should still have a main training loop where the number of epochs are defined.
E.g. take a look at the ImageNet example which also uses DDP if enabled.