Training crashes machine when using more than 4 GPUs

Hi,

I have a problem training a model on my server with 8 GPUs. The model has no problems training on 4 GPUs. However, when I try with 5 or more GPUs it reboots without printing any errors or warnings. The memory, wattage, temperature, GPU-Util of the individuals GPU (monitored via nvidia-smi) look normal at the moment of the freeze/reboot. The same goes for the system’s memory and CPU usage.

The code is the following:

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import numpy as np
from torch.autograd import Variable
import os


class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        param = (length,) + size
        print(param)
        self.data = torch.randn(*param)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

class Model(nn.Module):

    def __init__(self, inp=1, out=64):
        super(Model, self).__init__()
        self.conv3d = nn.Conv3d(inp,out,kernel_size=3,padding=(1,1,1))

    def forward(self, inp):
        output = self.conv3d(inp)
        print("\tIn Model: input size", inp.size(), "output size", output.size())
        return output

class C3D(nn.Module):

    def __init__(self):

        super(C3D,self).__init__()
        
        J = 32      
        K = 64      
        L = 128

        self.group0 = nn.Sequential(
            nn.Conv3d(1,J,kernel_size=3,padding=(1,1,1)),
            nn.ReLU()
        )

        self.group1 = nn.Sequential(
            nn.Conv3d(J,K,kernel_size=3,padding=(1,1,1)),
            nn.MaxPool3d(2, stride=2, ceil_mode=True),
            nn.ReLU()
        )

        self.group2 = nn.Sequential(
            nn.Conv3d(K, L, kernel_size=4, padding=(1,1,1)),
            nn.MaxPool3d(2, stride=2,ceil_mode=True),
            nn.ReLU()
        )

        self._features = nn.Sequential(
            self.group0,
            self.group1,
            self.group2,
        )

    def forward (self,x):
        out = self._features(x)
        return out

    def weights_init(self,m):
        classname = m.__class__.__name__
        if classname.find('Conv') != -1:
            m.weight.data.normal_(0.0, 0.02)
            m.bias.data.normal_(0.0, 0.02)

# RUN MODEL

# Parameters

GPUs = [0, 1, 2, 3, 4, 5, 6, 7]

input_size = (96,96,96)
output_size = 2

batch_size = 30
data_size = 120

# Generate data 

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

# Setup model on GPUs

os.environ["CUDA_VISIBLE_DEVICES"] = ','.join((str(e) for e in GPUs))

model = C3D()
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  model = nn.DataParallel(model)
  model.cuda()

# Run model

for data in rand_loader:
    data = Variable(torch.Tensor(np.expand_dims(data,axis=4)))
    data = data.permute(0,4,1,2,3)
    data.cuda()
    output = model(data)
    print("Outside: input size", data.size(), "output_size", output.size())

The tests I have done so far are the following:

  • I ran the code with 4 GPUs without any problems. I then ran the same code above with 5 or more GPUs and it caused my machine to reboot without printing any warnings or errors. Before rebooting, the metrics that I monitor with nvidia-smi and htop don’t show anything abnormal.
  • I then tried to run a smaller version of the model above by removing the third layer ie. the self.group2 method. With 8 GPUs it caused a reboot. With 6 GPUs it also caused a reboot. However, it worked with 5 or less GPUs. Some of the runs with 6 or more GPUs managed to print a few inferences before crashing the machine. The higher the number GPUs, the less inferences it managed to print before crashing.
  • I also tried to launch 2 runs in parallel of the same model on the same machine using 4 GPUs for the first run and the remaining 4 GPUs for the second run. After launching the second run, the machine rebooted.
  • To exclude the hypothesis that a GPU was malfunctioning, I first ran the model on GPUs 0, 1, 2, 3. It worked. I then ran the model on GPUs 1, 2, 3, 4. It also worked. However, when the model was run on GPUs 0, 1, 2, 3, 4, it caused the machine to reboot.
  • I also tried to run the same model with an input of (96,96), Conv2d and MaxPool2d functions. I wanted to test if it was related to the 3D methods specifically. However, the problem also occurs with this 2D version of the data and model.

It seems like the reboot/crash depends mainly on the number of GPUs used, as well as the model size and the number of forward passes iterated.

Regarding the environment, I used PyTorch: 1.3.0, NVIDIA-SMI: 418.67, Driver Version: 418.67 CUDA Version: 10.1.

Regarding the hardware, I am using 8x Nvidia Titan RTX GPUs with 24 GB memory mounted on a Supermicro SuperServer 7049GP-TRT mobo. The machine has 252 GB of RAM

The hypotheses I have so far are the following:

  • hardware related: I don’t exclude that the problem might be power related or memory related, although metrics on nvidia-smi and htop don’t support this hypothesis.
  • software related: something goes wrong with the PyTorch module, in particular when running a model on more than 4 GPUs. Maybe a memory leak on the GPUs?!

Please let me know, what you think and if you have any suggestions. Thanks

Matthias

Based on the description I would suspect a faulty or weak PSU, which is causing the reboot.
However, could you run the code with 4 different GPUs and make sure that it works with all devices?

Thanks, it was indeed a PSU problem! Two of the four power cables of the server were not plugged-in properly. It was enough to power the machine and run trivial jobs but not enough to train my model. Thanks for your help, Matthias

1 Like