Hi All,
I want to skip training iteration for a single GPU when training loss is greater than 10,000 (for example). The naive way of doing it is as follows.
for x in train_dataloader:
optimizer.zero_grad()
out = model(x)
out_criterion = criterion(out, x)
out_criterion.backward()
if out_criterion > 10000: #skipping optimizer update
continue
optimizer.step()
However, this results in “stalled training” due to synchronizaton issue between the GPU’s. Is there any other efficient way of achieving this especially in the DDP setting?
Maybe you could allreduce
the losses, check its value, and skip the step()
on all ranks instead of checking it on each rank separately which would diverge the runs.
1 Like
Thank you for your reply, Do you mean like this?
dist.barrier()
dist.all_reduce(out_criterion, op=torch.distributed.ReduceOp.SUM)
The drawback of this is that if we have N GPU’s for training and only single GPU has e.g. out_criterion = 10000, while the others has normal losses, then the gradients of all those other GPU’s will also be wasted…, since we are not taking the optimization step.
Yes, but this would also be expected since all parameters in each rank are equal in each iteration. If you skip the update on one rank your training would diverge since parameters would differ.
Got it thanks for your response.