Problem with Computing Loss in DDP Setup

Imahn · May 11, 2022, 8:44am

I have a problem with computing the loss in the DDP setup. This is how I do it:

loss = ... # per batch
train_loss_batch.append(loss) # per batch
if (batch_idx % 60) == 0: 
    print(f'Training Loss: {loss}')        
train_losses_epoch.append(np.sum(train_loss_batch)/len(train_loader.dataset)) # per epoch

So basically, per batch, I calculate the loss, append it to a list, and per epoch, I sum over the losses in the list and take the mean over the number of samples.

But here are two problems I’m facing:

Problem 1:

Train Epoch: 0 [61440 / 760000 (08.08 %)]	Training Loss: 3016.5581	Elapsed Time: 162.10 s
Train Epoch: 0 [122880 / 760000 (16.17 %)]	Training Loss: 2030.2073	Elapsed Time: 309.48 s
Train Epoch: 0 [184320 / 760000 (24.25 %)]	Training Loss: 1733.3897	Elapsed Time: 458.85 s
Epoch 00: 499.45 sec ...
Averaged training loss: 2334.3673, validation loss: 430.6891

Basically, the problem is that there is a huge discrepancy between the validation loss using DDP and not using DDP. So when not using DDP, these are the values I had gotten:

Averaged training loss: 3419.7038, validation loss: 1355.6035

Problem 2:

Train Epoch: 1 [61440 / 760000 (08.08 %)]	Training Loss: 1553.6359	Elapsed Time: 158.11 s
Train Epoch: 1 [122880 / 760000 (16.17 %)]	Training Loss: 1494.2148	Elapsed Time: 308.32 s
Train Epoch: 1 [184320 / 760000 (24.25 %)]	Training Loss: 1396.2194	Elapsed Time: 454.41 s
Epoch 01: 502.54 sec ...
Averaged training loss: 383.5921, validation loss: 354.7226

Here, there is around a factor 4 difference between the training losses in the batch updates and in the final printout after the epoch training is finished. This makes sense, as every DDP process gets only 1/4 of the data, but how can I add the loss terms of all processes to get the correct loss?

I’d really appreciate some help

Yanli_Zhao · May 13, 2022, 8:10pm

looks like the discrepancy is expected, because params are synced and updated correctly using DDP?
average the training loss among processes for each iteration?

Imahn · May 14, 2022, 12:15pm

Thank you for your answers.

1.) But why is there a discrepancy only for the validation loss, not for the training loss?
2.) Sounds good! Any idea how to achieve this?