Hi,
I was implementing logging snippets in Ultralytics YOLOv8, and I realized that they way it calculates loss in training is the following
# Forward
with torch.cuda.amp.autocast(self.amp):
# a couple of lines of code
self.loss, self.loss_items = self.criterion(preds, batch)
if RANK != -1:
self.loss *= world_size
self.tloss = (self.tloss * i + self.loss_items) / (i + 1) if self.tloss is not None \
else self.loss_items
So it basically get the loss calculated in the main process and multiply the loss by the number of processes.
It seems reasonable, and it shouldnât be different from the âreal lossâ by a lot, but it is not the actual aggregated loss from different processes.
Does DDP ever aggregate the all the losses calculated across different processes?
Or above snippet is the only way to get the loss?
No, DDP
would not aggregate the losses from different ranks as each rank gets an independent input and calculates âits ownâ gradients. The gradients are allreduced during the backward pass and eventually all .grad
attributes contain the same gradients before the corresponding parameters are updated.
More details about DDP
can be found in the DDP internal design doc.
I guess your code assumes an approx. same loss on different ranks and just scales it for e.g. logging purpose?
Yeah, the code is from YOLOv8, an open source by Ultralytics for object detection, and it assumed the same losses on different ranks like you said. So I was just wondering if there is a way to get the exact loss calculated across the different ranks, but approximation seems reasonable and good enough for logging in my case.
Thanks for your reply!
Yes, you could manually allreduce the losses as seen in e.g. this example which sums the losses here.
1 Like