nn.DataParallel(model).cuda() hangs

thnkim · March 20, 2017, 2:06pm

Hi,

If I use cuda for my network by
model.cuda()
Everything is ok. The model is big, so it consumes 91% of video memory.

If I use
model = nn.DataParallel(model).cuda()
Then it seems to progress at first, but soon it hangs. When I press CTRL-C, I always get messages as follows:

Traceback (most recent call last):
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/polphit/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

I tried on two different machines and got the same issue.

Ubuntu 16.04,
conda 4.3.14,
pytorch installed from source,
python 3.6.0.final.0
requests 2.12.4
CUDA 8.0
cuDNN 5.1

When I run the same code on a machine without conda, and python3, it works well.

Can I get a clue to resolve this issue?
Thank you.

apaszke · March 20, 2017, 2:07pm

That’s a stack trace of a data loader process, can you paste a full error into a gist and link it here?

thnkim · March 20, 2017, 2:35pm

Oh, that stack trace is all what I could get, since it just hangs without error.
Well, I guess it’s a kind of synchronization issue.
I have four networks netA, netB, netC, netD, which were

netA = nn.DataParallel(netA).cuda()
netB = nn.DataParallel(netB).cuda()
netC = netC.cuda(0)
netD = netD.cuda(1)

(I have two GPU devices)

Flow is

i (input) -> netA ---> netB -> x (output #1)
                   +-> netC -> y (output #2)
                   +-> netD -> z (output #3)

If this is not helpful to guess the cause, I would like to simplify my codes to reproduce the issue with minimal data upload.

thnkim · March 20, 2017, 2:50pm

Oh, when I add

torch.cuda.synchronize()

at the end of a batch, one machine works properly, although the other machine still has the same issue.

apaszke · March 21, 2017, 11:13am

Oh yeah this will happen. It’s because nn.DataParallel uses NVIDIA nccl library and it just deadlocks if you happen to do two calls at the same time… I guess we’ll need to add some mutexes.

apaszke · March 21, 2017, 11:14am

Unfortunately even if we add these locs, doing that in two processes that use the same GPUs in DataParallel will deadlock too…

melody-rain · April 13, 2017, 7:32am

so…is it a bug of pytorch? I met the same issue.

apaszke · April 19, 2017, 1:15pm

No, it’s a bug in NCCL (NVIDIA’s library). But you probably shouldn’t be using the same GPU in multiple data parallel jobs anyway.

octopusyo · September 10, 2017, 3:26pm

I have similar problem resulting in process hanging if I use DataParallel on 2 K80 GPUs. Do you know what might be an issue @apaszke? If I restrict to one GPU only everything is working fine.

smth · October 20, 2017, 5:26pm

hi everyone, NVIDIA’s @ngimel has investigated this problem, and the hangs might not be related to pytorch. She has written a detailed comment here on figuring out the issue and working around it:

Please have a look and see if it applies to you.

Mohit_Chhabra · June 26, 2018, 8:30am

Hi! I am facing a similar problem with Titan X GPUs with Pytorch 0.4. I am running it in a docker and allocate 12GB shared memory. On using nn.Dataparallel I get:
Runtime error: NCCL error 1, unhandled cuda error.
I tried the iommu disable option, and have the latest nccl2 library installed.
I tried conda and pip installs as well, but they give the same NCCL 1 error. Sometimes, the code deadlocks and the GPUs show 100% utilization.
p2p bandwidth latency test passes.
Any help would be appreciated.
Thanks

Mohit_Chhabra · June 27, 2018, 2:46am

It was some hardware issue, although the p2platency test passed, when i changed the CUDA_VISIBLE_DEVICES , and ran the code without DataParallel it gave illegal memory access error for the faulty GPU. Changed the GPU and now the code works fine.