Problem with Computing Loss in DDP Setup

I have a problem with computing the loss in the DDP setup. This is how I do it:

loss = ... # per batch
train_loss_batch.append(loss) # per batch
if (batch_idx % 60) == 0: 
    print(f'Training Loss: {loss}')        
train_losses_epoch.append(np.sum(train_loss_batch)/len(train_loader.dataset)) # per epoch

So basically, per batch, I calculate the loss, append it to a list, and per epoch, I sum over the losses in the list and take the mean over the number of samples.

But here are two problems I’m facing:

  • Problem 1:
Train Epoch: 0 [61440 / 760000 (08.08 %)]	Training Loss: 3016.5581	Elapsed Time: 162.10 s
Train Epoch: 0 [122880 / 760000 (16.17 %)]	Training Loss: 2030.2073	Elapsed Time: 309.48 s
Train Epoch: 0 [184320 / 760000 (24.25 %)]	Training Loss: 1733.3897	Elapsed Time: 458.85 s
Epoch 00: 499.45 sec ...
Averaged training loss: 2334.3673, validation loss: 430.6891

Basically, the problem is that there is a huge discrepancy between the validation loss using DDP and not using DDP. So when not using DDP, these are the values I had gotten:

Averaged training loss: 3419.7038, validation loss: 1355.6035
  • Problem 2:
Train Epoch: 1 [61440 / 760000 (08.08 %)]	Training Loss: 1553.6359	Elapsed Time: 158.11 s
Train Epoch: 1 [122880 / 760000 (16.17 %)]	Training Loss: 1494.2148	Elapsed Time: 308.32 s
Train Epoch: 1 [184320 / 760000 (24.25 %)]	Training Loss: 1396.2194	Elapsed Time: 454.41 s
Epoch 01: 502.54 sec ...
Averaged training loss: 383.5921, validation loss: 354.7226

Here, there is around a factor 4 difference between the training losses in the batch updates and in the final printout after the epoch training is finished. This makes sense, as every DDP process gets only 1/4 of the data, but how can I add the loss terms of all processes to get the correct loss?

I’d really appreciate some help :slightly_smiling_face:

  1. looks like the discrepancy is expected, because params are synced and updated correctly using DDP?

  2. average the training loss among processes for each iteration?

Thank you for your answers.

1.) But why is there a discrepancy only for the validation loss, not for the training loss?
2.) Sounds good! Any idea how to achieve this?