Comparing training Losses after each epoch on single-gpu system and multiple-gpu system

I wanted to know when training a model on multiple gpu’s using nn.DistributedDataParallel, does the training loss get affected when compared to single gpu system, keeping all the parameters including batch size, learning rate same?

Training a model with data parallelism, if done correctly, is mathematically equivalent to training in serial on one GPU. If the serial model and the replica models (in DDP) all begin with the same initial weights and use same training parameters, then there should be no discernible difference between the final models.

In my case, i am not able to replicate the results of single gpu system to 3-gpu system. The error seems to become stagnant after certain epochs unlike single-gpu system. Also after some epochs, There appear to be 6 leaked semaphore objects ... error appears unexpectedly. Can you help me out here?

72 semaphores I guess.

DDP doesn’t use semaphores, so it’s likely that your model code has some issues.

There are a few things that usually go wrong when doing DDP that needs to be accounted for. random draws during training and batchnorm are two common sources of problems.

The model works fine without DDP on a single gpu.