Program freezes on using torch.distributed.all_gather during forward pass

I have defined some buffer variables in init for the model, implementing similar idea as MoCo.

Model: 

def __init__(self, config):
    <other modules>
    self.K = 66536
    self.register_buffer("queue", torch.randn(768, config.NETWORK.MEMORY_BANK_SIZE))
    self.queue = nn.functional.normalize(self.queue, dim=0)
    self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long))

During the forward pass, I gather a tensor “a” using torch.distributed. After doing that, I am able to print their shape and type but unable to use or print the buffer tensors or the gathered tensor. The training script freezes there.

def forward(self, batch_input):
    with torch.no_grad():
        a = some_module(batch_input) # compute tensor a by passing batch_input through some module
    self.concat_all_gather(a)

@torch.no_grad()
def concat_all_gather(self,tensor):            

    print("PRE_{}_{}".format(torch.distributed.get_rank(),self.queue_ptr))

    tensors_gather = [torch.ones_like(tensor) for _ in range(torch.distributed.get_world_size())]:
    torch.distributed.all_gather(tensors_gather, tensor, async_op=False)
    
    # Unable to access tensors_gather and buffer variables. Variables like self.K
    print(type(self.queue_ptr), self.queue_ptr.shape)
    print("POST_{}_{}".format(torch.distributed.get_rank(),self.queue_ptr))
   

    output = torch.cat(tensors_gather, dim=0)
    return output

Output:

PRE_3_tensor([0], device='cuda:3')
<class 'torch.Tensor'> torch.Size([1])

PRE_0_tensor([0], device='cuda:0')
<class 'torch.Tensor'> torch.Size([1])

PRE_2_tensor([0], device='cuda:2')
<class 'torch.Tensor'> torch.Size([1])

PRE_1_tensor([0], device='cuda:1')
<class 'torch.Tensor'> torch.Size([1])

When I comment out torch.distributed.all_gather(tensors_gather, tensor, async_op=False)

Output:

PRE_2_tensor([0], device='cuda:2')
<class 'torch.Tensor'> torch.Size([1])
POST_2_tensor([0], device='cuda:2')


PRE_0_tensor([0], device='cuda:0')
<class 'torch.Tensor'> torch.Size([1])
POST_0_tensor([0], device='cuda:0')

PRE_1_tensor([0], device='cuda:1')
<class 'torch.Tensor'> torch.Size([1])
POST_1_tensor([0], device='cuda:1')

PRE_3_tensor([0], device='cuda:3')
<class 'torch.Tensor'> torch.Size([1])
POST_3_tensor([0], device='cuda:3')


Also, there is no problem when 1 process, GPU is used.

What might be the problem here? Is the implementation for gathering the tensor incorrect?

NCCL_DEBUG=INFO output:

cv03:25267:25267 [0] NCCL INFO Bootstrap : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25267:25267 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cv03:25267:25267 [0] NCCL INFO NET/IB : No device found.
cv03:25267:25267 [0] NCCL INFO NET/Socket : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25267:25267 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
cv03:25271:25271 [2] NCCL INFO Bootstrap : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25269:25269 [1] NCCL INFO Bootstrap : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25272:25272 [3] NCCL INFO Bootstrap : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25271:25271 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cv03:25272:25272 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cv03:25271:25271 [2] NCCL INFO NET/IB : No device found.
cv03:25271:25271 [2] NCCL INFO NET/Socket : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25271:25271 [2] NCCL INFO Using network Socket
cv03:25269:25269 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
cv03:25269:25269 [1] NCCL INFO NET/IB : No device found.
cv03:25269:25269 [1] NCCL INFO NET/Socket : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25269:25269 [1] NCCL INFO Using network Socket
cv03:25272:25272 [3] NCCL INFO NET/IB : No device found.
cv03:25272:25272 [3] NCCL INFO NET/Socket : Using [0]enp1s0f0:128.59.8.153<0>
cv03:25272:25272 [3] NCCL INFO Using network Socket
cv03:25271:25457 [2] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
cv03:25269:25458 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
cv03:25271:25457 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
cv03:25267:25456 [0] NCCL INFO Channel 00/02 :    0   1   2   3
cv03:25269:25458 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
cv03:25271:25457 [2] NCCL INFO Setting affinity for GPU 2 to 3ff003ff
cv03:25272:25460 [3] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
cv03:25267:25456 [0] NCCL INFO Channel 01/02 :    0   1   2   3
cv03:25269:25458 [1] NCCL INFO Setting affinity for GPU 1 to 3ff003ff
cv03:25272:25460 [3] NCCL INFO Trees [0] -1/-1/-1->3->2|2->3->-1/-1/-1 [1] -1/-1/-1->3->2|2->3->-1/-1/-1
cv03:25272:25460 [3] NCCL INFO Setting affinity for GPU 3 to 3ff003ff
cv03:25267:25456 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
cv03:25267:25456 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
cv03:25267:25456 [0] NCCL INFO Setting affinity for GPU 0 to 3ff003ff
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 0(=1a000)
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 1(=1b000)
cv03:25267:25456 [0] NCCL INFO Could not enable P2P between dev 0(=1a000) and dev 3(=61000)
cv03:25272:25460 [3] NCCL INFO Could not enable P2P between dev 3(=61000) and dev 2(=60000)
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 2(=60000)
cv03:25269:25458 [1] NCCL INFO Channel 00 : 1[1b000] -> 2[60000] via direct shared memory
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 3(=61000)
cv03:25267:25456 [0] NCCL INFO Could not enable P2P between dev 0(=1a000) and dev 1(=1b000)
cv03:25271:25457 [2] NCCL INFO Channel 00 : 2[60000] -> 3[61000] via direct shared memory
cv03:25267:25456 [0] NCCL INFO Channel 00 : 0[1a000] -> 1[1b000] via direct shared memory
cv03:25272:25460 [3] NCCL INFO Could not enable P2P between dev 3(=61000) and dev 0(=1a000)
cv03:25272:25460 [3] NCCL INFO Channel 00 : 3[61000] -> 0[1a000] via direct shared memory
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 2(=60000)
cv03:25267:25456 [0] NCCL INFO Could not enable P2P between dev 0(=1a000) and dev 1(=1b000)
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 3(=61000)
cv03:25272:25460 [3] NCCL INFO Could not enable P2P between dev 3(=61000) and dev 2(=60000)
cv03:25272:25460 [3] NCCL INFO Channel 00 : 3[61000] -> 2[60000] via direct shared memory
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 0(=1a000)
cv03:25269:25458 [1] NCCL INFO Channel 00 : 1[1b000] -> 0[1a000] via direct shared memory
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 1(=1b000)
cv03:25267:25456 [0] NCCL INFO Could not enable P2P between dev 0(=1a000) and dev 3(=61000)
cv03:25271:25457 [2] NCCL INFO Channel 00 : 2[60000] -> 1[1b000] via direct shared memory
cv03:25272:25460 [3] NCCL INFO Could not enable P2P between dev 3(=61000) and dev 2(=60000)
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 0(=1a000)
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 1(=1b000)
cv03:25267:25456 [0] NCCL INFO Could not enable P2P between dev 0(=1a000) and dev 1(=1b000)
cv03:25267:25456 [0] NCCL INFO Channel 01 : 0[1a000] -> 1[1b000] via direct shared memory
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 3(=61000)
cv03:25272:25460 [3] NCCL INFO Could not enable P2P between dev 3(=61000) and dev 0(=1a000)
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 2(=60000)
cv03:25271:25457 [2] NCCL INFO Channel 01 : 2[60000] -> 3[61000] via direct shared memory
cv03:25272:25460 [3] NCCL INFO Channel 01 : 3[61000] -> 0[1a000] via direct shared memory
cv03:25269:25458 [1] NCCL INFO Channel 01 : 1[1b000] -> 2[60000] via direct shared memory
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 3(=61000)
cv03:25267:25456 [0] NCCL INFO Could not enable P2P between dev 0(=1a000) and dev 1(=1b000)
cv03:25272:25460 [3] NCCL INFO Could not enable P2P between dev 3(=61000) and dev 2(=60000)
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 2(=60000)
cv03:25272:25460 [3] NCCL INFO Channel 01 : 3[61000] -> 2[60000] via direct shared memory
cv03:25271:25457 [2] NCCL INFO Could not enable P2P between dev 2(=60000) and dev 1(=1b000)
cv03:25271:25457 [2] NCCL INFO Channel 01 : 2[60000] -> 1[1b000] via direct shared memory
cv03:25269:25458 [1] NCCL INFO Could not enable P2P between dev 1(=1b000) and dev 0(=1a000)
cv03:25272:25460 [3] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cv03:25272:25460 [3] NCCL INFO comm 0x7fe118001060 rank 3 nranks 4 cudaDev 3 busId 61000 - Init COMPLETE
cv03:25269:25458 [1] NCCL INFO Channel 01 : 1[1b000] -> 0[1a000] via direct shared memory
cv03:25267:25456 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cv03:25267:25456 [0] NCCL INFO comm 0x7f144c001060 rank 0 nranks 4 cudaDev 0 busId 1a000 - Init COMPLETE
cv03:25267:25267 [0] NCCL INFO Launch mode Parallel
cv03:25271:25457 [2] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cv03:25271:25457 [2] NCCL INFO comm 0x7fd448001060 rank 2 nranks 4 cudaDev 2 busId 60000 - Init COMPLETE
cv03:25269:25458 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
cv03:25269:25458 [1] NCCL INFO comm 0x7f58d4001060 rank 1 nranks 4 cudaDev 1 busId 1b000 - Init COMPLETE
native distributed, size: 4, rank: 3, local rank: 3
native distributed, size: 4, rank: 2, local rank: 2
native distributed, size: 4, rank: 1, local rank: 1
native distributed, size: 4, rank: 0, local rank: 0

This is how the distributed group is initialized:

distributed.init_process_group(
                backend='nccl',
                init_method='tcp://{}:{}'.format(master_address, master_port),
                world_size=world_size,
                rank=rank,
                group_name='mtorch')

The all_gather call here does look okay to me. Could you share a minimal repro script that we can try on our end? Few possibilities leading to a freeze could involve:

  1. The order of allgather calls are not the same on all ranks.
  2. There could be some sort of cuda synchronization causing a deadlock with allgather (ex: allgather on node1 waiting for node2, but allgather on node2 is waiting for a cuda synchronize or something).

There is a colon after this line, could it be the cause?

Have you been able to resolve this issue?