nn.DataParallel(model).cuda() hangs

Hi,

If I use cuda for my network by
model.cuda()
Everything is ok. The model is big, so it consumes 91% of video memory.

If I use
model = nn.DataParallel(model).cuda()
Then it seems to progress at first, but soon it hangs. When I press CTRL-C, I always get messages as follows:

Traceback (most recent call last):
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/polphit/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
    r = index_queue.get()
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/queues.py", line 343, in get
    res = self._reader.recv_bytes()
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

I tried on two different machines and got the same issue.

  • Ubuntu 16.04,
  • conda 4.3.14,
  • pytorch installed from source,
  • python 3.6.0.final.0
  • requests 2.12.4
  • CUDA 8.0
  • cuDNN 5.1

When I run the same code on a machine without conda, and python3, it works well.

Can I get a clue to resolve this issue?
Thank you.

1 Like

That’s a stack trace of a data loader process, can you paste a full error into a gist and link it here?

Oh, that stack trace is all what I could get, since it just hangs without error.
Well, I guess it’s a kind of synchronization issue.
I have four networks netA, netB, netC, netD, which were

netA = nn.DataParallel(netA).cuda()
netB = nn.DataParallel(netB).cuda()
netC = netC.cuda(0)
netD = netD.cuda(1)

(I have two GPU devices)

Flow is

i (input) -> netA ---> netB -> x (output #1)
                   +-> netC -> y (output #2)
                   +-> netD -> z (output #3)

If this is not helpful to guess the cause, I would like to simplify my codes to reproduce the issue with minimal data upload.

Oh, when I add

torch.cuda.synchronize()

at the end of a batch, one machine works properly, although the other machine still has the same issue.

Oh yeah this will happen. It’s because nn.DataParallel uses NVIDIA nccl library and it just deadlocks if you happen to do two calls at the same time… I guess we’ll need to add some mutexes.

1 Like

Unfortunately even if we add these locs, doing that in two processes that use the same GPUs in DataParallel will deadlock too…

so…is it a bug of pytorch? I met the same issue.

No, it’s a bug in NCCL (NVIDIA’s library). But you probably shouldn’t be using the same GPU in multiple data parallel jobs anyway.

I have similar problem resulting in process hanging if I use DataParallel on 2 K80 GPUs. Do you know what might be an issue @apaszke? If I restrict to one GPU only everything is working fine.

hi everyone, NVIDIA’s @ngimel has investigated this problem, and the hangs might not be related to pytorch. She has written a detailed comment here on figuring out the issue and working around it:

Please have a look and see if it applies to you.

1 Like

Hi! I am facing a similar problem with Titan X GPUs with Pytorch 0.4. I am running it in a docker and allocate 12GB shared memory. On using nn.Dataparallel I get:
Runtime error: NCCL error 1, unhandled cuda error.
I tried the iommu disable option, and have the latest nccl2 library installed.
I tried conda and pip installs as well, but they give the same NCCL 1 error. Sometimes, the code deadlocks and the GPUs show 100% utilization.
p2p bandwidth latency test passes.
Any help would be appreciated.
Thanks

It was some hardware issue, although the p2platency test passed, when i changed the CUDA_VISIBLE_DEVICES , and ran the code without DataParallel it gave illegal memory access error for the faulty GPU. Changed the GPU and now the code works fine.