NCCL Error 2 when training with 2 GPUs

Hi all,

I am training a model with 2 GTX 3090 GPUs. Driver is 455.32.00, CUDA version is 11.1, and torch.cuda.nccl.version() yields 2708.

To enable multi-GPU, I have something like this:

 def to(self, device, available_devices):
        if available_devices > 1:
            self.net = torch.nn.DataParallel(self.net).to(device)

However, when training I get the following stack trace:

File "nn.py", line 641, in _iter_fit                                                                                                                                                                              
    outputs = self.net(inputs)                                                                                                                                                                                                                                                     
  File "torch/nn/modules/module.py", line 727, in _call_impl                                                                                                                                                                         
    result = self.forward(*input, **kwargs)                                                                                                                                                                                                                                        
  File "helpers/default_net.py", line 124, in forward                                                                                                                                                               
    output = self._forward_net(input)                                                                                                                                                                                                                                               
  File "torch/nn/modules/module.py", line 727, in _call_impl                                                                                                                                                                         
    result = self.forward(*input, **kwargs)                                                                                                                                                                                                                                        
  File "torch/nn/parallel/data_parallel.py", line 160, in forward                                                                                                                                                                    
    replicas = self.replicate(self.module, self.device_ids[:len(inputs)])                                                                                                                                                                                                          
  File "torch/nn/parallel/data_parallel.py", line 165, in replicate                                                                                                                                                                  
    return replicate(module, device_ids, not torch.is_grad_enabled())                                                                                                                                                                                                              
  File "torch/nn/parallel/replicate.py", line 88, in replicate                                                                                                                                                                       
    param_copies = _broadcast_coalesced_reshape(params, devices, detach)                                                                                                                                                                                                           
  File "torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape                                                                                                                                                    
    tensor_copies = Broadcast.apply(devices, *tensors)                                                                                                                                                                                                                             
  File "torch/nn/parallel/_functions.py", line 22, in forward                                                                                                                                                                        
    outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)                                                                                                                                                                                                                    
  File "torch/nn/parallel/comm.py", line 56, in broadcast_coalesced                        
    return torch._C._broadcast_coalesced(tensors, devices, buffer_size)                                                                                                                                                                                                            
RuntimeError: NCCL Error 2: unhandled system error 

With this additional info dump from NCCL:

78244:78244 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation                                                                                                                                                                                                                                                                                                                                                                                                                                             
78244:78244 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]                                                                                                                                                                                                   
78244:78244 [0] NCCL INFO Using network Socket                                                                                                                                                                                                                               
NCCL version 2.7.8+cuda11.0                                                                                                                                                                                                                                                        
78244:78465 [0] NCCL INFO Call to connect returned Connection timed out, retrying                                                                                                                                                                                            
78244:78466 [1] NCCL INFO Call to connect returned Connection timed out, retrying                                                                                                                                                                                            
78244:78465 [0] NCCL INFO Call to connect returned Connection timed out, retrying                                                                                                                                                                                            
78244:78466 [1] NCCL INFO Call to connect returned Connection timed out, retrying                                                                                                                                                                                            
                                                                                                                                                                                                                                                                                   
78244:78465 [0] include/socket.h:403 NCCL WARN Connect failed : Connection timed out                                                                                                                                                                   
                                                                                                                                                                                                                                                                                   
78244:78466 [1] include/socket.h:403 NCCL WARN Connect failed : Connection timed out                                                                                                                                                                   
78244:78466 [1] NCCL INFO bootstrap.cc:95 -> 2                                                                                                                                                                                                                               
78244:78466 [1] NCCL INFO bootstrap.cc:309 -> 2                                                                                                                                                                                                                              
78244:78466 [1] NCCL INFO init.cc:555 -> 2                                                                                                                                                                                                                                   
78244:78466 [1] NCCL INFO init.cc:840 -> 2                                                                                                                                                                                                                                   
78244:78465 [0] NCCL INFO bootstrap.cc:95 -> 2                                                                                                                                                                                                                               
78244:78466 [1] NCCL INFO group.cc:73 -> 2 [Async thread]                                                                                                                                                                                                                    
78244:78465 [0] NCCL INFO bootstrap.cc:309 -> 2                                                                                                                                                                                                                              
78244:78465 [0] NCCL INFO init.cc:555 -> 2                                                                                                                                                                                                                                   
78244:78465 [0] NCCL INFO init.cc:840 -> 2                                                                                                                                                                                                                                   
78244:78465 [0] NCCL INFO group.cc:73 -> 2 [Async thread]                                                                                                                                                                                                                    
78244:78244 [0] NCCL INFO init.cc:906 -> 2    

Any ideas on what might be happening? Feedback will be deeply appreciated.

Double post from here.