Comparing training Losses after each epoch on single-gpu system and multiple-gpu system

Peerzada_Danish · February 14, 2023, 10:27am

I wanted to know when training a model on multiple gpu’s using nn.DistributedDataParallel, does the training loss get affected when compared to single gpu system, keeping all the parameters including batch size, learning rate same?

Jamie_Donnelly · February 14, 2023, 2:10pm

Training a model with data parallelism, if done correctly, is mathematically equivalent to training in serial on one GPU. If the serial model and the replica models (in DDP) all begin with the same initial weights and use same training parameters, then there should be no discernible difference between the final models.

Peerzada_Danish · February 14, 2023, 3:01pm

In my case, i am not able to replicate the results of single gpu system to 3-gpu system. The error seems to become stagnant after certain epochs unlike single-gpu system. Also after some epochs, There appear to be 6 leaked semaphore objects ... error appears unexpectedly. Can you help me out here?

Peerzada_Danish · February 14, 2023, 3:02pm

72 semaphores I guess.

kumpera · February 14, 2023, 3:17pm

DDP doesn’t use semaphores, so it’s likely that your model code has some issues.

There are a few things that usually go wrong when doing DDP that needs to be accounted for. random draws during training and batchnorm are two common sources of problems.

Peerzada_Danish · February 14, 2023, 3:22pm

The model works fine without DDP on a single gpu.