DataParallel vs single GPU training

tecker · March 23, 2023, 2:03pm

Same seed (details below), same machine. DP: DataParallel

Single GPU: batch size 32, learning rate 0.01
4 GPUs DP: batch size 32, learning rate 0.0025

Shall these two settings have the same training result?
I think they should have the same result, but the experimental results on CIFAR10 dataset show similar but different training losses and accuracies.

By the way, how about the DistributedDataTraining (DDP) process? 4 GPUs DDP: batch size 32, learning rate 0.01? Will this lead to the same result?
Thank you very much in advance!

random.seed(seed)
np.random.seed(seed)
os.environ['PHTHONHASHSEED'] = str(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
            torch.cuda.manual_seed(seed)
            torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministric =True

Jamie_Donnelly · March 24, 2023, 2:23pm

Training a model in parallel like this is supposedly mathematically equivalent to the serial implementation - meaning you should obtain the same model at the end of training whether you used DataParallel or not.

So by training the DataParallel model with a learning rate of 0.0025 you would be getting the same final model as if you trained on a single GPU with that learning rate. Set the same hyperparameters for a parallel implementation as you would want for a serial implementation.

Manually configuring some of the backends like you have will probably help for consistency between approaches since often different low-level code might get executed unaware to the user.