Multi-GPU training with training loop that can skip backpropagation

Hi,
I am training a model which for certain batches has no loss (and skip backprop) when certain conditions are met regarding the model output. I am trying to train this model using DDP, however, if one of the GPUs has no loss, it will not perform backprop, and the other GPUs will wait indefinitely for the gradients of this GPU, leading to a timeout. Is there a way to deal with such cases? The code works fine for a single GPU. I tried to create a dummy loss and use that to perform backprop but it did not help. Any help would be appreciated.

            if self.args.amp:
                optimizer.zero_grad()
                # to check for cases when there is nothing to optimize
                if losses.grad_fn is None:
                    losses = losses + torch.zeros(1, requires_grad=True).to(losses.device)
                scaler.scale(losses).backward()
                # Check if there are valid gradients before stepping the optimizer
                valid_gradients = any(p.grad is not None for p in model.parameters() if p.requires_grad)
                if valid_gradients:
                    if self.args.clip_max_norm > 0:
                        scaler.unscale_(optimizer)
                        torch.nn.utils.clip_grad_norm_(model.parameters(), self.args.clip_max_norm)
                    scaler.step(optimizer)
                    scale = scaler.get_scale()
                    scaler.update()
                    skip_lr_step = (scale > scaler.get_scale())
                else:
                    print("Skipping backprop step.")
                    skip_lr_step = True