I’m getting a Segmentation Fault when trying to run my code on an instance with 4 Tesla K80s. If I use any 2 GPUs, the model trains just fine. But when I increase the number of GPUs to 3, I get the Segmentation Fault (core dumped
. I’ve tried running gdb and here’s what I see:
Thread 49 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffefaffb700 (LWP 16808)]
0x00007ffff062a8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) where
#0 0x00007ffff062a8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007ffff077a914 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffff0716e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff77e46ba in start_thread (arg=0x7ffefaffb700) at pthread_create.c:333
#4 0x00007ffff6e0a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb)
I’m running within a conda environment on Pytorch 0.3.1 in python 2 and am using Cuda 8.0. I’m also using OpenCV.
I can actually reproduce this issue with this simple script:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.fc1 = nn.Linear(16 * 53 * 53, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 10)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = x.view(-1, 16 * 53 * 53)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
if __name__=="__main__":
net = Net()
net.cuda()
net = nn.DataParallel(net)
X = torch.rand(2,3,224,224).cuda()
X = Variable(X)
y = net.forward(X)
If I comment out net = nn.DataParallel(net)
, it only runs when I specify CUDA_VISIBLE_DEVICES=3
. Anything else produces a Segmentation Fault with the same gdb output as above.