Separate scripts on separate GPUs generate system freeze

Hey there,

so I’ve been using PyTorch (0.3.1) for a while now and have tried to fix the following issue for some time now:

When I run a training script simulatenously (i.e. in two separate terminal instances) with different parameters, but same overall structure, on two separate 1080Tis, I get a full system freeze under Ubuntu 16.04 which I can’t leave without a force shutdown. Alse, there is no prior error message.
This doesn’t happen when I utilize either one of the GPUs alone.

Since I have encountered this issue, I have tried multiple things to fix this:
[1] Upgraded PSU to 1200W
[2] Set shared memory to high values.
[3] Use torch.multiprocessing.set_start_method(‘spawn’)
[4] Varied # of workers. Except for 0 workers, the error still occurs. Unfortunately, using the 0 worker option is not a feasible option for me. This most likely points to a multiprocessing issue; but I don’t know enough in that regard to investigate deeper. Also, its hard to tell since, again, no error message is thrown.
[5] Tested Ubuntu 18.04 - Same issue
[6] Tested PyTorch 0.4 - Same issue
[7] Worked my way through most of the nvidia-drivers.

Since this is most likely (although not necessarily) a specific issue either with my PC or my training script, I don’t expect full solutions in that regard. But pointers to options that could be helpful and that I haven’t looked into would be nice :).

Note:
The training script structure is fairly straightforward: Initialize Dataloader with paths to image files -> Iterate over them one by one.

Could you try to run your script via CUDA_VISIBLE_DEVICES=YOUR_GPU_ID python script.py?
This will make sure the current script only sees the specified GPU.
Let me know, if that helps.

I’ve been doing that internally via os.environ["CUDA_VISIBLE_DEVICES"]= <device_num> in the script itself. Does it make a difference to instead declare it when calling the script?

Nonetheless, I’ll still give it a try :slight_smile:

So unfortunately, as expected, choosing the device outside of the actual python script did not change anything, the system still crashed :/.

Maybe any other suggestions? :slight_smile:

I think we should narrow down the error.
Could you try to run the scripts without the GPUs?
If that still hangs, could you use num_workers=0 in your DataLoaders and run it again?

I actually haven’t tried running without a GPU yet! I’ll see what that gives me.

Sidenote: I’ve already testednum_workers=0 (with GPU) and it works.

Are you working on windows?
Edit: just saw you wrote ubuntu.
Have you tried covering your script with if __name__ == "__main__" :

Thanks for the suggestion!
Fortunately (or unfortunately) I’ve already tried putting my main function and all relevant setting into if __name__ == "__main__", but it didn’t seem to help.

Could you check your shared memory limit? This github issue deals probably with the same problem.

And did you try adding freeze-support like this?

from multiprocessing import freezele_support
... 
if __name__ == "__main__" : 
    freeze_support() 
   ... 

EDIT: what happens if you combine both scripts into a single script and parallelize it by hand (spawning some processes yourself)?

@ptrblck: I’ve checked that issue and tested everything listed there, but increasing the shared memory did not solve this issue.

@justusschock: So as it turns out, when I put my complete script into a main()-handle (which I thought I had already tried but apparently not :)) I do not get a full system freeze any longer! However, scripts similarly stop running, with either one of these errors being thrown:

  1. RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:630 during optimizer.step().
  2. DataLoader worker (pid 15189) is killed by signal: Segmentation fault or Illegal Instruction which happens arbitrarily, also either during optimizer.step().

Full errors are appended.

Any ideas what could cause this/what generally leads to errors like this?

FULL ERRORS

EXAMPLE 1

ERROR: Unexpected segmentation fault encountered in worker
Traceback (most recent call last):
  File "script.py", line 590, in <module>
    main()
  File "script.py", line 540, in main
    trainer(epoch)
  File "script.py", line 395, in trainer
    optimizer.step()
  File "/home/user/software/miniconda3/envs/D3L/lib/python3.6/site-packages/torch/optim/adam.py", line 72, in step
    denom = exp_avg_sq.sqrt().add_(group['eps'])
  File "/home/user/software/miniconda3/envs/D3L/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 175, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 9437) is killed by signal: Segmentation fault.

EXAMPLE 2

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCTensorMathPairwise.cu line=81 error=4 : unspecified launch failure                                   | 418/2092 [02:38<10:34,  2.64it/s]
Traceback (most recent call last):
  File "script.py", line 590, in <module>
    main()
  File "script.py", line 540, in main
    trainer(epoch)
  File "script.py", line 395, in trainer
    optimizer.step()
  File "/home/user/software/miniconda3/envs/D3L/lib/python3.6/site-packages/torch/optim/adam.py", line 69, in step
    exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: cuda runtime error (4) : unspecified launch failure at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/THC/generic/THCTensorMathPairwise.cu:81

EXAMPLE 3

Traceback (most recent call last):
  File "script.py", line 590, in <module>
    main()
  File "script.py", line 540, in main
    trainer(epoch)
  File "script.py", line 395, in trainer
    optimizer.step()
  File "/home/user/software/miniconda3/envs/D3L/lib/python3.6/site-packages/torch/optim/adam.py", line 72, in step
    denom = exp_avg_sq.sqrt().add_(group['eps'])
  File "/home/user/software/miniconda3/envs/D3L/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 175, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 6114) is killed by signal: Illegal instruction.

It seems to be an issue with how you update the parameters. Could you post a small code snippet?

And could you try executing the same script on CPU?

Ok so the reply took some time as running two full networks on CPU only in parallel takes some time.

Nevertheless, doing so shows the following:
When running the same script in parallel (see bottom of paragraph) in cpu-only mode (so simple discarding any .cuda()-commands, for 4 days straight & 12 epochs in total), there is NO freezing/hang-up. When using GPUs, I get a freeze after roughly 1.5 hours or 10 epochs.

This is weird, since the script stays the same in both cases (with the exception of whether or not a GPU is used).


Below you can see the relevant functions I use, invoked subsequently:

General Setup

UNet        = network_library.UNet(**Network_Parameters)
UNet.weight_init('he_normal')

Base_Loss_t   = auxiliaries.Loss_Provider('dice_loss')
optimizer         = torch.optim.Adam(UNet.parameters(), lr=opt.lr, weight_decay=opt.l2_reg)
scheduler        = lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.2)

DataLoader-Setup

train_dataset       = Dataset_2D(path_to_data_files)
train_data_loader   = DataLoader(train_dataset, num_workers=6, batch_size=8, pin_memory=False, shuffle=True)

Note: I don’t think there is any error with the Dataloader now since it works fine using CPUs only.

Training Function

def training(epoch):
    _ = UNet.train()

    mini_dice1  = []
    mini_dice2  = []
    mini_loss   = []
    mini_time   = []

    train_data_iter = tqdm(train_data_loader, position=1)
    inp_string      = 'Epoch {} || Loss: --- | Dice: ---'.format(epoch)

    for slice_idx, full_file_dict in enumerate(train_data_iter):

        train_data_iter.set_description(inp_string)
        train_iter_start_time = time.time()

        training_slice  = full_file_dict["input_slice"]
        training_slice  = Variable(training_slice).type(torch.FloatTensor).cuda()

        #--- Run Training ---
        network_output = UNet(training_slice)

        ### BASE LOSS
        feed_dict = {'input':network_output}
        feed_dict['target'] = Variable(full_file_dict['ground_truth_mask']).cuda()

        loss       = Base_Loss_t(**feed_dict)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        mini_dice1.append(np.round(get_cuda(network_output)))
        mini_dice2.append(full_file_dict['ground_truth_mask'].numpy())

        mini_loss.append(loss.data.cpu().numpy()[0])
        mini_time.append(np.round(time.time()-train_iter_start_time,4))

I do hope I made some obvious mistake, but any help is appreciated!

Since the error might occur inside your dataloader, it would be interesting what kind of data you read and maybe to see your dataset code.

I’ve skimmed through your train code and it looks fine to me.

I’ll post the dataset code later - just a quick question: Doesn’t the script running in CPU-mode only show that everything regarding the dataloader should be fine? Or am I missing something?

To be honest: I don’t know.
Your error looks pretty strange, but since it might be related to your data it can’t be wrong to have a look on your data loading.

Do you set your CUDA_VISIBLE_DEVICES before or after importing torch or modules, that use torch?

Ok sure :).

Well I’ve tried both, setting CUDA_VISIBLE_DEVICES=0 or 1 when calling it via CUDA_VISIBLE_DEVICES=0 python script.py and after importing torch via os.environ().

Also as a side note: For whatever reasons, I don’t get a error raise anymore, but go straight to system freezes… it’s really weird :confused:

Well so I have tried:

  • [1] Moving everything to Windows, but same problem there - no frozen system, but bluescreen.
  • [2] Tested under Pytorch 0.4.1 - Same Problem.
  • [3] Set integrated GPU as display card and use NVIDIA Gpus only for PyTorch - Same Problem.

So I can pretty much rule out OS-problems. Also, I don’t think there is a hardware issue, since the PSU should be powerful enough.

So in a last hope, I’ve attached a functional version of the dataset. I hope there is some error in there :D.
In it, I accumulate every relevant path to my data files in a dictionary and, well, load it in _getitem_().

"""=================================="""
"""====== Load Basic Libraries ======"""
"""=================================="""
import numpy as np
import os, sys, time, csv, itertools, copy

from tqdm import tqdm, trange

import torch
import torch.nn as nn
from torch.utils.data import Dataset

sys.path.insert(0, '../Helper_Functions')
sys.path.insert(0, '../Network_Library')

import helper_functions as hf
import Network_Auxiliaries as aux

import scipy.ndimage.measurements as snm
import skimage.transform as st

import pandas as pd


""""""""""""""""""""""""""""""""""""
"""===== Dataset 4 Training ==="""
""""""""""""""""""""""""""""""""""""
class Dataset_Training(Dataset):
    def __init__(self, base_path, train_val_split=0.8, perc_data=1., standardize=True, augment=False, crop=[224,224], seed=1):

        self.rng            = np.random.RandomState(seed)

        ### Read in CSVs containing paths to data slices (0.5MB numpy files .npy)
        v   = pd.read_csv(base_path+"Assign_2D_Volumes.csv",header=0)
        l   = pd.read_csv(base_path+"Assign_2D_LiverMasks.csv",header=0)
        n   = pd.read_csv(base_path+"Assign_2D_LesionMasks.csv",header=0)
        wle = pd.read_csv(base_path+"Assign_2D_LesionWmaps.csv",header=0)

        ### Get unique data volumess
        self.available_volumes = sorted(list(set(np.array(v["Volume"]))),key=lambda x: int(x.split('-')[-1]))
        self.rng.shuffle(self.available_volumes)
        self.available_volumes = self.available_volumes[:int(len(self.available_volumes)*perc_data)]
        self.available_volumes = self.available_volumes[:int(len(self.available_volumes)*train_val_split)]

        ### Create path dictionary
        roi_vicinity = 4
        self.training_volumes = {key:{"LesWmap_Paths":[], "LesMask_Paths":[],"LivMask_Paths":[],"Has Lesion":[], "Vol_Paths":[],"LesSlices":[], "LivSlices":[]} for key in self.available_volumes}

        iter_vals = tqdm(v["Volume"])
        iter_vals.set_description('Reading and assigning training data paths')

        for i,vol in enumerate(iter_vals):
            if np.sum(l["Has Mask"][i:i+roi_vicinity]) and vol in self.available_volumes:
                self.training_volumes[vol]["Vol_Paths"].append(v["Slice Path"][i])
                self.training_volumes[vol]["LivMask_Paths"].append(l["Slice Path"][i])
                self.training_volumes[vol]["LesMask_Paths"].append(n["Slice Path"][i])
                self.training_volumes[vol]["Has Lesion"].append(n["Has Mask"][i])
                self.training_volumes[vol]["LesWmap_Paths"].append(wle["Slice Path"][i])

        for i,vol in enumerate(self.available_volumes):
            if np.sum(self.training_volumes[vol]["Has Lesion"]):
                self.training_volumes[vol]["LesSlices"] = list(np.where(self.training_volumes[vol]["Has Lesion"])[0])
                self.training_volumes[vol]["LivSlices"] = list(set(np.arange(0,len(self.training_volumes[vol]["Has Lesion"])))-set(self.training_volumes[vol]["LesSlices"]))
            else:
                self.training_volumes[vol]["LesSlices"] = []
                self.training_volumes[vol]["LivSlices"] = list(np.arange(0,len(self.training_volumes[vol]["Has Lesion"])))

        self.n_files = np.sum([len(self.training_volumes[key]["Vol_Paths"]) for key in self.training_volumes.keys()])

        ### Other input arguments
        self.standardize    = standardize
        self.augment        = augment
        self.crop_size      = crop
        self.n_crops        = n_crops
        self.channel_size   = channel_size



    def __getitem__(self, idx):
        #>> Data Volume Array Of Interest
        VOI = self.available_volumes[self.rng.randint(0,len(self.available_volumes))]
        #>> There are two types of Slices - randomly pick one
        chosen_slice_from_VOI = self.rng.randint(0,3)>0 and len(self.training_volumes[VOI]["LesSlices"])>0

        #>> Pick a random slice from either category
        if chosen_slice_from_VOI:
            SOI = self.rng.choice(self.training_volumes[VOI]["LesSlices"])
        else:
            SOI = self.rng.choice(self.training_volumes[VOI]["LivSlices"])

        #>> Load Slice of Interest
        V2O = np.expand_dims(np.expand_dims(np.load(self.training_volumes[VOI]["Vol_Paths"][SOI]),0),0)
        #>> Perform data standardization if required
        V2O = hf.normalize(V2O, supply_mode="orig")

        #>> Load The Respective Target Mask
        Les2O = np.load(self.training_volumes[VOI]["LesMask_Paths"][SOI])
        Les2O = np.expand_dims(np.expand_dims(Les2O,0),0)

        #>> Load An Additional Crop Mask
        Liv2O = np.expand_dims(np.expand_dims(np.load(self.training_volumes[VOI]["LivMask_Paths"][SOI]),0),0)
        #>> And A Used Cost Map
        Wmap2O = np.expand_dims(np.expand_dims(np.load(self.training_volumes[VOI]["LesWmap_Paths"][SOI]),0),0)


        #>> Images Are Too Big So They Get Cropped.
        files_to_crop  = [V2O, Les2O, Wmap2O]

        #>> But First, I Augment Them (rotation & zooming with scipy.ndimage.interpolation.rotate/zoom)
        files_to_crop = list(hf.augment_2D(files_to_crop, copy_files=True, seed=self.rng.randint(0,1e8), is_mask = [0,1,0]))

        #>> Crop Images - Function slices random subarray from input arrays in files_to_crop. Liv2O provides the regions in which to crop.
        crops_for_picked_batch  = hf.get_crops_per_batch(files_to_crop, Liv2O, crop_size=self.crop_size)
        V2O     = crops_for_picked_batch[0]
        Les2O   = crops_for_picked_batch[1]
        Wmap2O  = crops_for_picked_batch[2]

        return_dict = {"vol":V2O[0,:], "lesmask":Les2O[0,:], "wmap": Wmap2O[0,:].astype('float')}
        return return_dict


    def __len__(self):
        return int(self.n_files)






""""""""""""""""""""""""""""""""""""
"""===== Dataset 4 Validation ==="""
""""""""""""""""""""""""""""""""""""
class Dataset_Validation(Dataset):
    def __init__(self, base_path, tv_split=0.85, standardize=True, seed=1, is_training=False, perc_data=1.):
        self.tv_split    = tv_split
        self.standardize = standardize
        self.rng = np.random.RandomState(seed)


        ### Read in CSVs to data files (npy-files, 0.5MB)
        v   = pd.read_csv(base_path+"Assign_2D_Volumes.csv",header=0)
        l   = pd.read_csv(base_path+"Assign_2D_LiverMasks.csv",header=0)
        n   = pd.read_csv(base_path+"Assign_2D_LesionMasks.csv",header=0)

        ### Get unique volumes
        self.available_volumes = sorted(list(set(np.array(v["Volume"]))),key=lambda x: int(x.split('-')[-1]))
        self.rng.shuffle(self.available_volumes)
        self.available_volumes = self.available_volumes[:int(len(self.available_volumes)*perc_data)]
        self.available_volumes = self.available_volumes[int(len(self.available_volumes)*tv_split):]
        self.available_volumes.sort()

        self.validation_volumes = {key:{"LesMask_Paths":[],"LivMask_Paths":[],"Vol_Paths":[]} for key in self.available_volumes}

        iter_vals = tqdm(v["Volume"])
        iter_vals.set_description('Reading and assigning validation data paths')

        for i,vol in enumerate(iter_vals):
            if vol in self.available_volumes:
                if l['Has Mask'][i]:
                    self.validation_volumes[vol]["Vol_Paths"].append(v["Slice Path"][i])
                    self.validation_volumes[vol]["LivMask_Paths"].append(l["Slice Path"][i])
                    self.validation_volumes[vol]["LesMask_Paths"].append(n["Slice Path"][i])

        self.volume_separators = [len(self.validation_volumes[vol]["Vol_Paths"]) for vol in self.validation_volumes.keys()]
        self.validation_data   = {"vol":[], "lesmask":[], "livmask":[]}


        for vol in self.available_volumes:
            for i in range(len(self.validation_volumes[vol]["Vol_Paths"])):
                vol_slices = self.validation_volumes[vol]["Vol_Paths"][i]
                liv_slices = self.validation_volumes[vol]["LivMask_Paths"][i]
                les_slices = self.validation_volumes[vol]["LesMask_Paths"][i]

                self.validation_data["vol"].append(vol_slices)
                self.validation_data["livmask"].append(liv_slices)
                self.validation_data["lesmask"].append(les_slices)


        self.n_files = len(self.validation_data["vol"])


    def __getitem__(self, idx):
        #>> Slice of Interest
        V2O  = np.expand_dims(np.expand_dims(np.load(self.validation_data["vol"][idx]),0),0)
        #>> Data Standardization
        V2O = hf.normalize(V2O, supply_mode="orig")

        Les2O = np.expand_dims(np.expand_dims(np.load(self.validation_data["lesmask"][idx]),0),0)
        Liv2O = np.expand_dims(np.expand_dims(np.load(self.validation_data["livmask"][idx]),0),0)
        return_dict = {"vol":V2O[0,:], "lesmask":Les2O[0,:], "livmask":Liv2O[0,:]}

        return return_dict

    def __len__(self):
        return self.n_files

Your code looks fine to me. What kind of data do you use? I want to try it myself.
Have you tried to replace your loaded data with random data (just to ensure that the error is not the loading part)?