Hi All,

I want to skip training iteration for a single GPU when training loss is greater than 10,000 (for example). The naive way of doing it is as follows.

```
for x in train_dataloader:
optimizer.zero_grad()
out = model(x)
out_criterion = criterion(out, x)
out_criterion.backward()
if out_criterion > 10000: #skipping optimizer update
continue
optimizer.step()
```

However, this results in â€śstalled trainingâ€ť due to synchronizaton issue between the GPUâ€™s. Is there any other efficient way of achieving this especially in the DDP setting?

Maybe you could `allreduce`

the losses, check its value, and skip the `step()`

on all ranks instead of checking it on each rank separately which would diverge the runs.

1 Like

Thank you for your reply, Do you mean like this?

```
dist.barrier()
dist.all_reduce(out_criterion, op=torch.distributed.ReduceOp.SUM)
```

The drawback of this is that if we have N GPUâ€™s for training and only single GPU has e.g. out_criterion = 10000, while the others has normal losses, then the gradients of all those other GPUâ€™s will also be wastedâ€¦, since we are not taking the optimization step.

Yes, but this would also be expected since all parameters in each rank are equal in each iteration. If you skip the update on one rank your training would diverge since parameters would differ.

Got it thanks for your response.