I have a problem with computing the loss in the DDP setup. This is how I do it:
loss = ... # per batch
train_loss_batch.append(loss) # per batch
if (batch_idx % 60) == 0:
print(f'Training Loss: {loss}')
train_losses_epoch.append(np.sum(train_loss_batch)/len(train_loader.dataset)) # per epoch
So basically, per batch, I calculate the loss, append it to a list, and per epoch, I sum over the losses in the list and take the mean over the number of samples.
But here are two problems I’m facing:
- Problem 1:
Train Epoch: 0 [61440 / 760000 (08.08 %)] Training Loss: 3016.5581 Elapsed Time: 162.10 s
Train Epoch: 0 [122880 / 760000 (16.17 %)] Training Loss: 2030.2073 Elapsed Time: 309.48 s
Train Epoch: 0 [184320 / 760000 (24.25 %)] Training Loss: 1733.3897 Elapsed Time: 458.85 s
Epoch 00: 499.45 sec ...
Averaged training loss: 2334.3673, validation loss: 430.6891
Basically, the problem is that there is a huge discrepancy between the validation loss using DDP and not using DDP. So when not using DDP, these are the values I had gotten:
Averaged training loss: 3419.7038, validation loss: 1355.6035
- Problem 2:
Train Epoch: 1 [61440 / 760000 (08.08 %)] Training Loss: 1553.6359 Elapsed Time: 158.11 s
Train Epoch: 1 [122880 / 760000 (16.17 %)] Training Loss: 1494.2148 Elapsed Time: 308.32 s
Train Epoch: 1 [184320 / 760000 (24.25 %)] Training Loss: 1396.2194 Elapsed Time: 454.41 s
Epoch 01: 502.54 sec ...
Averaged training loss: 383.5921, validation loss: 354.7226
Here, there is around a factor 4
difference between the training losses in the batch updates and in the final printout after the epoch training is finished. This makes sense, as every DDP process gets only 1/4
of the data, but how can I add the loss terms of all processes to get the correct loss?
I’d really appreciate some help