Loss.backward() hangs at the second iteration when using DDP

The network works well when using a single GPU.

  for cur_it in range(total_it_each_epoch):
      try:
          batch = next(dataloader_iter)
      except StopIteration:
          dataloader_iter = iter(train_loader)
          batch = next(dataloader_iter)
          print('new iters')

      lr_scheduler.step(accumulated_iter)

      if rank == 0:
          print('\n---data passed---')

      try:
          cur_lr = float(optimizer.lr)
      except:
          cur_lr = optimizer.param_groups[0]['lr']

      model.train()
      optimizer.zero_grad()

      loss, tb_dict, disp_dict = model_func(model, batch)

      if rank == 0:
          print('\n---forward passed---')

      print(loss)
      
      loss.backward()

      optimizer.step()

      if rank == 0:
          print('\n---backward passed---')

      time.sleep(0.01)

Here is the output:

it works well in the first iteration; however, it hangs at the second iteration (sometimes the third iteration).
I am really confused, any help would be appreciated.

Could you provide a minimal script to reproduce the hang? In particular, can you include the model construction (e.g. how are you using DistributedDataParallel)?

Thanks for your reply!
I identify the problem by skipping some code blocks. It could be caused by a customed operator (CUDA extension).
But I cannot find the specific reason.

Skipping layers is not allowed without using find_unused_parameters=True since the backward pass will wait for the gradient of all registered parameters to be ready that that it can communicate it.
From the docs:

find_unused_parameters (bool) – Traverse the autograd graph from all tensors contained in the return value of the wrapped module’s forward function. Parameters that don’t receive gradients as part of this graph are preemptively marked as being ready to be reduced. In addition, parameters that may have been used in the wrapped module’s forward function but were not part of loss computation and thus would also not receive gradients are preemptively marked as ready to be reduced. (default: False)

Thanks. But the problem occurs when all parameters are used to compute loss.
I skip some code blocks and pass find_unused_parameters=True, it is exactly the method I used to locate the bug.
And I find out the bug is related to a customed operator.