nn.DataParallel(model).cuda() stuck

Hi,
I am using cuda for a simple model in the mnist example with
model=torch.nn.DataParallel(model, device_ids=[0,1]).cuda()
The program than hangs with 100% GPU-Util on GPUs 1 and 2 under nvidia-smi, although it runs fine with one GPU.

It seems similar to a previous posting. However, I am using 1080ti, which seems to work fine for other users.

Raw code and stack trace are here:

__Python VERSION: 3.6.2 |Anaconda custom (64-bit)| (default, Sep 22 2017, 02:03:08)
[GCC 7.2.0]
__PyTorch VERSION: 0.2.0_4
__CUDNN VERSION: 6021
__Number CUDA Devices: 4

2 Likes

Can somebody please reply? Need help here.
It runs well on one GPU, but not multiple ones.

In addition, when I was able to use all GPUs with tensorflow. Just the problem with pytorch.

Hey @SamTse can you tell us what CUDA version you are on?

Also by any chance can you check that this is a problem with source install:

1 Like

Thanks for getting back to me.

We installed cuda 9.0 with:
cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64.deb

pytorch and cuda80 were installed with conda.

Why are you mixing cuda versions? Could it be that something is dynamically linked to cudart 9.0 instead of 8.0?

Reverted to cuda-8.0 and installed pytorch from source. But still had the same problem.
Single GPU cuda runs OK, but two GPUs, the code hangs. The stacktrace was posted above. Please help. Stuck here.

1080Ti cards. nvidia-smi shows 100% GPU-util in both cards.

1 Like

Hey,
I built Pytorch from source of the current master (https://github.com/pytorch/pytorch/commit/cc3058bdac925dc20ba18b5829f689a67227f753) with “Cuda compilation tools, release 8.0, V8.0.61”. (I also tried pytorch v0.2 from binaries with conda)

I also use two 1080TI and as soon as use nn.DataParallel some memory gets allocated, GPU load jumps to 100%, but then it gets stuck:

from torch import nn
from torch.autograd import Variable
import torch

l = nn.Linear(5,5).cuda()
pl = nn.DataParallel(l)
print("Checkpoint 1")
a = Variable(torch.rand(5,5).cuda(), requires_grad=True)
print("Checkpoint 2")
print(pl(a)) # Here it gets stuck
print("Checkpoint 3")

I reproduced the same problem as tjoseph. Stack trace is here:

I’ve exactly the same here. any solution ?

Maybe you can help here. https://github.com/pytorch/pytorch/issues/1637 or just upvote :slight_smile:

One more question: What CPU do you guys have?

1 Like

I am using AMD Ryzen Threadripper 1950X 16-Core Processor

Okay, I am using a AMD Ryzen 1700 and my version most probably is affected by the segfault bug (https://community.amd.com/thread/215773). Just wanted to make sure that there is no realllly rare coincidence. But Threadripper is not affected.

hi everyone, NVIDIA’s @ngimel has investigated this problem, and the hangs might not be related to pytorch. She has written a detailed comment here on figuring out the issue and working around it:

Please have a look and see if it applies to you.

1 Like

Thank you. It worked fine for me after following @ngimel’s suggestion.