Way to aggregate loss in DDP training

I was implementing logging snippets in Ultralytics YOLOv8, and I realized that they way it calculates loss in training is the following

# Forward
with torch.cuda.amp.autocast(self.amp):
    # a couple of lines of code
    self.loss, self.loss_items = self.criterion(preds, batch)
    if RANK != -1:
        self.loss *= world_size
    self.tloss = (self.tloss * i + self.loss_items) / (i + 1) if self.tloss is not None \
        else self.loss_items

So it basically get the loss calculated in the main process and multiply the loss by the number of processes.
It seems reasonable, and it shouldn’t be different from the ‘real loss’ by a lot, but it is not the actual aggregated loss from different processes.

Does DDP ever aggregate the all the losses calculated across different processes?
Or above snippet is the only way to get the loss?

No, DDP would not aggregate the losses from different ranks as each rank gets an independent input and calculates “its own” gradients. The gradients are allreduced during the backward pass and eventually all .grad attributes contain the same gradients before the corresponding parameters are updated.
More details about DDP can be found in the DDP internal design doc.
I guess your code assumes an approx. same loss on different ranks and just scales it for e.g. logging purpose?

Yeah, the code is from YOLOv8, an open source by Ultralytics for object detection, and it assumed the same losses on different ranks like you said. So I was just wondering if there is a way to get the exact loss calculated across the different ranks, but approximation seems reasonable and good enough for logging in my case.

Thanks for your reply!

Yes, you could manually allreduce the losses as seen in e.g. this example which sums the losses here.

1 Like