I was implementing logging snippets in Ultralytics YOLOv8, and I realized that they way it calculates loss in training is the following
# a couple of lines of code
self.loss, self.loss_items = self.criterion(preds, batch)
if RANK != -1:
self.loss *= world_size
self.tloss = (self.tloss * i + self.loss_items) / (i + 1) if self.tloss is not None \
So it basically get the loss calculated in the main process and multiply the loss by the number of processes.
It seems reasonable, and it shouldn’t be different from the ‘real loss’ by a lot, but it is not the actual aggregated loss from different processes.
Does DDP ever aggregate the all the losses calculated across different processes?
Or above snippet is the only way to get the loss?
DDP would not aggregate the losses from different ranks as each rank gets an independent input and calculates “its own” gradients. The gradients are allreduced during the backward pass and eventually all
.grad attributes contain the same gradients before the corresponding parameters are updated.
More details about
DDP can be found in the DDP internal design doc.
I guess your code assumes an approx. same loss on different ranks and just scales it for e.g. logging purpose?
Yeah, the code is from YOLOv8, an open source by Ultralytics for object detection, and it assumed the same losses on different ranks like you said. So I was just wondering if there is a way to get the exact loss calculated across the different ranks, but approximation seems reasonable and good enough for logging in my case.
Thanks for your reply!
Yes, you could manually allreduce the losses as seen in e.g. this example which sums the losses here.