When I run my code with multiple GPUs, it crashes occasionally with the following error:
File "main.py", line 132, in train
model.train(train_loader, val_loader)
File "/mnt/DATA/code/bitbucket/drn_seg/segment/seg_model.py", line 54, in train
self.train_epoch(epoch, train_loader, val_loader)
File "/mnt/DATA/code/bitbucket/drn_seg/segment/seg_model.py", line 93, in train_epoch
output = self.model_seg(image)[0]
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 491, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 113, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/data_parallel.py", line 118, in replicate
return replicate(module, device_ids)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/replicate.py", line 12, in replicate
param_copies = Broadcast.apply(devices, *params)
File "/usr/local/lib/python3.5/dist-packages/torch/nn/parallel/_functions.py", line 17, in forward
outputs = comm.broadcast_coalesced(inputs, ctx.target_gpus)
File "/usr/local/lib/python3.5/dist-packages/torch/cuda/comm.py", line 40, in broadcast_coalesced
return torch._C._broadcast_coalesced(tensors, devices, buffer_size)
RuntimeError: NCCL Error 2: unhandled system error