Training freezes after some epochs, no error

eschalck · August 13, 2020, 7:58am

Hello,

I am using images in DICOM format for my project.
In order to train my model, I implemented a custom dataset, which you’ll find bellow.
The training succeeds when I am using images of size 128x128.
However, since I swichted for images of size 512x512, the training gets stuck after a few epochs, without any error message. When I abort the process manually by using ctrl+C, I get the following error:

^C
Aborted!
^CException ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader._MultiProcessingDataLoaderIter object at 0x7f4c58bfdb00>>
Traceback (most recent call last):
  File "/home/elsa.schalck/anaconda3/envs/env_kaggle_osic/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 962, in __del__
    self._shutdown_workers()
  File "/home/elsa.schalck/anaconda3/envs/env_kaggle_osic/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 942, in _shutdown_workers
    w.join()
  File "/home/elsa.schalck/anaconda3/envs/env_kaggle_osic/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/home/elsa.schalck/anaconda3/envs/env_kaggle_osic/lib/python3.6/multiprocessing/popen_fork.py", line 50, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/home/elsa.schalck/anaconda3/envs/env_kaggle_osic/lib/python3.6/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)

I tried to set the number of workers of the dataloader from 8 to 0. When the number of workers is set to 0, the training succeeds, but takes really long.
I also tried to reduce the batch size (to 8 and 2), but the issue is still remaining with multiple workers.
I looked at the GPU memory during training, and it doesn’t seem to be the issue.

Do you have any explanation for this type of error ?
Is there a way to solve this error, in order to be able to use multiple workers ?
Thank you !

class DICOM2D_dataset(Dataset):

    def __init__(self, root_dir, patient_df, transform=None, one_img=False, set='train'):

        self.root_dir = root_dir
        self.transform = transform
        self.dir_from_root = '/data/01_raw/osic-pulmonary-fibrosis-progression/' + set

        slices = []
        for series in patient_df['Patient'].tolist():
            if one_img == False:
                for slice in os.listdir(os.path.join(root_dir + self.dir_from_root, series)):
                    slices.append(series + '/' + slice)
            else:
                slice = os.listdir(os.path.join(root_dir + self.dir_from_root, series))[0]
                slices.append(series + '/' + slice)

        self.slices = slices

    def get_img_hu(self, dicom):

        intercept = dicom[0x0028, 0x1052].value
        slope = dicom[0x0028, 0x1053].value

        image = dicom.pixel_array
        image = (image * slope + intercept).astype(np.int16)

        return image

    def __len__(self):
        return len(self.slices)

    def __getitem__(self, idx):

        slice = self.slices[idx]
        path_dicom = os.path.join(self.root_dir + self.dir_from_root, slice)
        d = pydicom.dcmread(path_dicom)

        # get image data in hounsfield unit
        image = self.get_img_hu(d)

        # add channel
        image = np.expand_dims(image, 0)

        # apply transformation
        if self.transform:
            image = self.transform(image)

        return image, path_dicom