Using convert_sync_batchnorm let my code be deadlock

when i use the code “net = torch.nn.SyncBatchNorm.convert_sync_batchnorm(net)” to replace BN with SyncBatchNorm, the code would be deadlock like this:

it seems to be a problem with dataloader. And the relevant code is as follows

Is there any kind person to help me?Thanks.

The difference between BatchNorm and SyncBatchNorm is that SyncBatchNorm uses torch.distributed.all_reduce in the backward pass.

Two questions:

  1. What args and env vars did you pass to init_process_group?
  2. If you program, is there any other code that launches communication ops?