Hi,
I am using cuda for a simple model in the mnist example with
model=torch.nn.DataParallel(model, device_ids=[0,1]).cuda()
The program than hangs with 100% GPU-Util on GPUs 1 and 2 under nvidia-smi, although it runs fine with one GPU.
It seems similar to a previous posting. However, I am using 1080ti, which seems to work fine for other users.
Reverted to cuda-8.0 and installed pytorch from source. But still had the same problem.
Single GPU cuda runs OK, but two GPUs, the code hangs. The stacktrace was posted above. Please help. Stuck here.
1080Ti cards. nvidia-smi shows 100% GPU-util in both cards.
I also use two 1080TI and as soon as use nn.DataParallel some memory gets allocated, GPU load jumps to 100%, but then it gets stuck:
from torch import nn
from torch.autograd import Variable
import torch
l = nn.Linear(5,5).cuda()
pl = nn.DataParallel(l)
print("Checkpoint 1")
a = Variable(torch.rand(5,5).cuda(), requires_grad=True)
print("Checkpoint 2")
print(pl(a)) # Here it gets stuck
print("Checkpoint 3")
Okay, I am using a AMD Ryzen 1700 and my version most probably is affected by the segfault bug (https://community.amd.com/thread/215773). Just wanted to make sure that there is no realllly rare coincidence. But Threadripper is not affected.
hi everyone, NVIDIA’s @ngimel has investigated this problem, and the hangs might not be related to pytorch. She has written a detailed comment here on figuring out the issue and working around it: