I have a problem training a model on my server with 8 GPUs. The model has no problems training on 4 GPUs. However, when I try with 5 or more GPUs it reboots without printing any errors or warnings. The memory, wattage, temperature, GPU-Util of the individuals GPU (monitored via nvidia-smi) look normal at the moment of the freeze/reboot. The same goes for the system’s memory and CPU usage.
The code is the following:
import torch import torch.nn as nn from torch.utils.data import Dataset, DataLoader import numpy as np from torch.autograd import Variable import os class RandomDataset(Dataset): def __init__(self, size, length): self.len = length param = (length,) + size print(param) self.data = torch.randn(*param) def __getitem__(self, index): return self.data[index] def __len__(self): return self.len class Model(nn.Module): def __init__(self, inp=1, out=64): super(Model, self).__init__() self.conv3d = nn.Conv3d(inp,out,kernel_size=3,padding=(1,1,1)) def forward(self, inp): output = self.conv3d(inp) print("\tIn Model: input size", inp.size(), "output size", output.size()) return output class C3D(nn.Module): def __init__(self): super(C3D,self).__init__() J = 32 K = 64 L = 128 self.group0 = nn.Sequential( nn.Conv3d(1,J,kernel_size=3,padding=(1,1,1)), nn.ReLU() ) self.group1 = nn.Sequential( nn.Conv3d(J,K,kernel_size=3,padding=(1,1,1)), nn.MaxPool3d(2, stride=2, ceil_mode=True), nn.ReLU() ) self.group2 = nn.Sequential( nn.Conv3d(K, L, kernel_size=4, padding=(1,1,1)), nn.MaxPool3d(2, stride=2,ceil_mode=True), nn.ReLU() ) self._features = nn.Sequential( self.group0, self.group1, self.group2, ) def forward (self,x): out = self._features(x) return out def weights_init(self,m): classname = m.__class__.__name__ if classname.find('Conv') != -1: m.weight.data.normal_(0.0, 0.02) m.bias.data.normal_(0.0, 0.02) # RUN MODEL # Parameters GPUs = [0, 1, 2, 3, 4, 5, 6, 7] input_size = (96,96,96) output_size = 2 batch_size = 30 data_size = 120 # Generate data rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size), batch_size=batch_size, shuffle=True) # Setup model on GPUs os.environ["CUDA_VISIBLE_DEVICES"] = ','.join((str(e) for e in GPUs)) model = C3D() if torch.cuda.device_count() > 1: print("Let's use", torch.cuda.device_count(), "GPUs!") model = nn.DataParallel(model) model.cuda() # Run model for data in rand_loader: data = Variable(torch.Tensor(np.expand_dims(data,axis=4))) data = data.permute(0,4,1,2,3) data.cuda() output = model(data) print("Outside: input size", data.size(), "output_size", output.size())
The tests I have done so far are the following:
- I ran the code with 4 GPUs without any problems. I then ran the same code above with 5 or more GPUs and it caused my machine to reboot without printing any warnings or errors. Before rebooting, the metrics that I monitor with nvidia-smi and htop don’t show anything abnormal.
- I then tried to run a smaller version of the model above by removing the third layer ie. the self.group2 method. With 8 GPUs it caused a reboot. With 6 GPUs it also caused a reboot. However, it worked with 5 or less GPUs. Some of the runs with 6 or more GPUs managed to print a few inferences before crashing the machine. The higher the number GPUs, the less inferences it managed to print before crashing.
- I also tried to launch 2 runs in parallel of the same model on the same machine using 4 GPUs for the first run and the remaining 4 GPUs for the second run. After launching the second run, the machine rebooted.
- To exclude the hypothesis that a GPU was malfunctioning, I first ran the model on GPUs 0, 1, 2, 3. It worked. I then ran the model on GPUs 1, 2, 3, 4. It also worked. However, when the model was run on GPUs 0, 1, 2, 3, 4, it caused the machine to reboot.
- I also tried to run the same model with an input of (96,96), Conv2d and MaxPool2d functions. I wanted to test if it was related to the 3D methods specifically. However, the problem also occurs with this 2D version of the data and model.
It seems like the reboot/crash depends mainly on the number of GPUs used, as well as the model size and the number of forward passes iterated.
Regarding the environment, I used PyTorch: 1.3.0, NVIDIA-SMI: 418.67, Driver Version: 418.67 CUDA Version: 10.1.
Regarding the hardware, I am using 8x Nvidia Titan RTX GPUs with 24 GB memory mounted on a Supermicro SuperServer 7049GP-TRT mobo. The machine has 252 GB of RAM
The hypotheses I have so far are the following:
- hardware related: I don’t exclude that the problem might be power related or memory related, although metrics on nvidia-smi and htop don’t support this hypothesis.
- software related: something goes wrong with the PyTorch module, in particular when running a model on more than 4 GPUs. Maybe a memory leak on the GPUs?!
Please let me know, what you think and if you have any suggestions. Thanks