Trouble with multiple GPU setup - thread lock

Hi all,

I have spent the past day trying to figure out how to use multiple GPUs. In theory, parallelizing models across multiple GPUs is supposed to be as as easy as simply wrapping models with nn.DataParallel. However, I have found that this does not work for me. To use the most simple and canonical thing I could find for proof of this, I ran the code in the Data Parallelism tutorial, line for line. The output is as follows - it is the same output that I get every time I try to run Pytorch with multiple GPUs:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-3-0f0d83e9ef13> in <module>
      1 for data in rand_loader:
      2     input = data.to(device)
----> 3     output = model(input)
      4     print("Outside: input size", input.size(),
      5           "output_size", output.size())

/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/usr/local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in forward(self, *inputs, **kwargs)
    141             return self.module(*inputs[0], **kwargs[0])
    142         replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
--> 143         outputs = self.parallel_apply(replicas, inputs, kwargs)
    144         return self.gather(outputs, self.output_device)
    145 

/usr/local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py in parallel_apply(self, replicas, inputs, kwargs)
    151 
    152     def parallel_apply(self, replicas, inputs, kwargs):
--> 153         return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
    154 
    155     def gather(self, outputs, output_device):

/usr/local/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py in parallel_apply(modules, inputs, kwargs_tup, devices)
     73             thread.start()
     74         for thread in threads:
---> 75             thread.join()
     76     else:
     77         _worker(0, modules[0], inputs[0], kwargs_tup[0], devices[0])

/usr/local/lib/python3.6/threading.py in join(self, timeout)
   1054 
   1055         if timeout is None:
-> 1056             self._wait_for_tstate_lock()
   1057         else:
   1058             # the behavior of a negative timeout isn't documented, but

/usr/local/lib/python3.6/threading.py in _wait_for_tstate_lock(self, block, timeout)
   1070         if lock is None:  # already determined that the C code is done
   1071             assert self._is_stopped
-> 1072         elif lock.acquire(block, timeout):
   1073             lock.release()
   1074             self._stop()

KeyboardInterrupt: 

Note that it hangs - I have to keyboard interrupt to stop. And the error is the same every time - some sort of deadlock is entered into, although I do not understand how or why.

Some information about my system:
Operating System: Ubuntu 16.04
GPUS: 4 1080tis
Pytorch version: 1.01
CUDA version: 10.0
NVIDIA Driver: 415

I have tried everything from only having a specific permutation of my GPUs be visible to CUDA to reinstalling everything related to CUDA but can’t figure out why I cannot run with multiple GPUs. If anyone could point me in the right direction, it would be greatly appreciated.

3 Likes

Was this ever resolved? I’m facing the same problem.

1 Like

Which version of PyTorch are you using? Can you share a self-contained repro?

1 Like

@ptrblck was this issue ever solved. I have seen a lot of threads on pytorch forums regarding NCCL deadlock but I didn’t find any solution.

I was not able to solve this issue, and my rig is currently disassembled and across the country so no way I can be of much help unfortunately.

It seems this issue was not solved and we don’t have a code snippet to reproduce this issue.
I would generally recommend to use the latest stable version (and try out the nightly, if possible) using the latest CUDA, NCCL etc. versions. If the error is still observable, an executable code snippet to reproduce this issue is very helpful.

1 Like

I tried the pytorch-nightly which uses NCCL 2.7.6. I have not faced the deadlock again yet. Thanks @ptrblck.

@Jeffrey_Wang you might want to try that out.

@ptrblck what could be issue with the previous version?

Nothing we are aware of, i.e. we haven’t seen deadlocks in NCCL 2.4 before.