If I use cuda for my network by model.cuda()
Everything is ok. The model is big, so it consumes 91% of video memory.
If I use model = nn.DataParallel(model).cuda()
Then it seems to progress at first, but soon it hangs. When I press CTRL-C, I always get messages as follows:
Traceback (most recent call last):
File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/home/polphit/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 28, in _worker_loop
r = index_queue.get()
File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/queues.py", line 343, in get
res = self._reader.recv_bytes()
File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/home/polphit/anaconda3/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt
I tried on two different machines and got the same issue.
Ubuntu 16.04,
conda 4.3.14,
pytorch installed from source,
python 3.6.0.final.0
requests 2.12.4
CUDA 8.0
cuDNN 5.1
When I run the same code on a machine without conda, and python3, it works well.
Can I get a clue to resolve this issue?
Thank you.
Oh, that stack trace is all what I could get, since it just hangs without error.
Well, I guess it’s a kind of synchronization issue.
I have four networks netA, netB, netC, netD, which were
Oh yeah this will happen. It’s because nn.DataParallel uses NVIDIA nccl library and it just deadlocks if you happen to do two calls at the same time… I guess we’ll need to add some mutexes.
I have similar problem resulting in process hanging if I use DataParallel on 2 K80 GPUs. Do you know what might be an issue @apaszke? If I restrict to one GPU only everything is working fine.
hi everyone, NVIDIA’s @ngimel has investigated this problem, and the hangs might not be related to pytorch. She has written a detailed comment here on figuring out the issue and working around it:
Hi! I am facing a similar problem with Titan X GPUs with Pytorch 0.4. I am running it in a docker and allocate 12GB shared memory. On using nn.Dataparallel I get:
Runtime error: NCCL error 1, unhandled cuda error.
I tried the iommu disable option, and have the latest nccl2 library installed.
I tried conda and pip installs as well, but they give the same NCCL 1 error. Sometimes, the code deadlocks and the GPUs show 100% utilization.
p2p bandwidth latency test passes.
Any help would be appreciated.
Thanks
It was some hardware issue, although the p2platency test passed, when i changed the CUDA_VISIBLE_DEVICES , and ran the code without DataParallel it gave illegal memory access error for the faulty GPU. Changed the GPU and now the code works fine.