Segmentation fault (core dumped) when running with >2 GPUs

I’m getting a Segmentation Fault when trying to run my code on an instance with 4 Tesla K80s. If I use any 2 GPUs, the model trains just fine. But when I increase the number of GPUs to 3, I get the Segmentation Fault (core dumped. I’ve tried running gdb and here’s what I see:

Thread 49 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffefaffb700 (LWP 16808)]
0x00007ffff062a8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) where
#0  0x00007ffff062a8d5 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1  0x00007ffff077a914 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2  0x00007ffff0716e80 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3  0x00007ffff77e46ba in start_thread (arg=0x7ffefaffb700) at pthread_create.c:333
#4  0x00007ffff6e0a41d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) 

I’m running within a conda environment on Pytorch 0.3.1 in python 2 and am using Cuda 8.0. I’m also using OpenCV.

I can actually reproduce this issue with this simple script:

import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 53 * 53, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 53 * 53)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

if __name__=="__main__":
        net = Net()
        net.cuda()
        net = nn.DataParallel(net)
        X = torch.rand(2,3,224,224).cuda()
        X = Variable(X)
        y = net.forward(X)

If I comment out net = nn.DataParallel(net), it only runs when I specify CUDA_VISIBLE_DEVICES=3. Anything else produces a Segmentation Fault with the same gdb output as above.

Seems I just had to reinstall my nvidia drivers.

I encountered the same issue on both my VM machines (ubuntu 14.04, 390.42, Cuda 8.0, Python 3.6) and (Ubuntu 16.04, 390.42, Cuda 9.0, Python 2.7). I had to reinstall the driver, but the problem still shows up after a short while.

(gdb) r -c ‘import torch; torch.zeros((3,3,3)).cuda(1)’
Starting program: /storage/litong/anaconda3/bin/python -c ‘import torch; torch.zeros((3,3,3)).cuda(1)’
[Thread debugging using libthread_db enabled]
Using host libthread_db library “/lib/x86_64-linux-gnu/libthread_db.so.1”.
[New Thread 0x7fffbd8dc700 (LWP 56214)]
[New Thread 0x7fffbd0db700 (LWP 56215)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffbd0db700 (LWP 56215)]
0x00007ffff0fcf805 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
(gdb) where
#0 0x00007ffff0fcf805 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#1 0x00007ffff111fc34 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#2 0x00007ffff10bc180 in ?? () from /usr/lib/x86_64-linux-gnu/libcuda.so.1
#3 0x00007ffff7bc4184 in start_thread (arg=0x7fffbd0db700)
at pthread_create.c:312
#4 0x00007ffff78f103d in clone ()
at …/sysdep
s/unix/sysv/linux/x86_64/clone.S:111