Hi all,
I am training a model with 2 GTX 3090 GPUs. Driver is 455.32.00, CUDA version is 11.1, and torch.cuda.nccl.version()
yields 2708
.
To enable multi-GPU, I have something like this:
def to(self, device, available_devices):
if available_devices > 1:
self.net = torch.nn.DataParallel(self.net).to(device)
However, when training I get the following stack trace:
File "nn.py", line 641, in _iter_fit
outputs = self.net(inputs)
File "torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "helpers/default_net.py", line 124, in forward
output = self._forward_net(input)
File "torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "torch/nn/parallel/data_parallel.py", line 160, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "torch/nn/parallel/data_parallel.py", line 165, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "torch/nn/parallel/replicate.py", line 88, in replicate
param_copies = _broadcast_coalesced_reshape(params, devices, detach)
File "torch/nn/parallel/replicate.py", line 71, in _broadcast_coalesced_reshape
tensor_copies = Broadcast.apply(devices, *tensors)
File "torch/nn/parallel/_functions.py", line 22, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "torch/nn/parallel/comm.py", line 56, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error
With this additional info dump from NCCL:
78244:78244 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
78244:78244 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
78244:78244 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.0
78244:78465 [0] NCCL INFO Call to connect returned Connection timed out, retrying
78244:78466 [1] NCCL INFO Call to connect returned Connection timed out, retrying
78244:78465 [0] NCCL INFO Call to connect returned Connection timed out, retrying
78244:78466 [1] NCCL INFO Call to connect returned Connection timed out, retrying
78244:78465 [0] include/socket.h:403 NCCL WARN Connect failed : Connection timed out
78244:78466 [1] include/socket.h:403 NCCL WARN Connect failed : Connection timed out
78244:78466 [1] NCCL INFO bootstrap.cc:95 -> 2
78244:78466 [1] NCCL INFO bootstrap.cc:309 -> 2
78244:78466 [1] NCCL INFO init.cc:555 -> 2
78244:78466 [1] NCCL INFO init.cc:840 -> 2
78244:78465 [0] NCCL INFO bootstrap.cc:95 -> 2
78244:78466 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
78244:78465 [0] NCCL INFO bootstrap.cc:309 -> 2
78244:78465 [0] NCCL INFO init.cc:555 -> 2
78244:78465 [0] NCCL INFO init.cc:840 -> 2
78244:78465 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
78244:78244 [0] NCCL INFO init.cc:906 -> 2
Any ideas on what might be happening? Feedback will be deeply appreciated.