Training freezes when using DataLoader with num_workers > 0


I am doing a grid search over many different hyper parameters. I wrote a script for this task that is generating all combinations of hyperparameters, then forks one thread for each GPU (I have 4 GPUs in the machine, so I use 4 threads) and then each thread trains a model. There is a queue with all hyperparameter configurations and each thread gets its current configuration from this list.

Each thread is doing these steps:

  • Read training and validation samples from h5 file
  • Initialize DataLoaders that also do some transformations (RandomFlip, Normalization, etc.)
  • Train for N epochs and validate after each epoch
  • Save results to a file

However, there is some strange bug that causes my script to freeze randomly. Sometimes in the first epoch, sometimes later and always on different batches within the epoch. nvidia-smi is showing that the GPU of the thread that freezes still uses memory but has 0% load.

| NVIDIA-SMI 390.67 Driver Version: 390.67 |
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| 0 GeForce GTX TIT… Off | 00000000:02:00.0 Off | N/A |
| 22% 26C P8 15W / 250W | 547MiB / 12212MiB | 0% Default |
| 1 GeForce GTX TIT… Off | 00000000:04:00.0 Off | N/A |
| 22% 27C P8 14W / 250W | 582MiB / 12212MiB | 0% Default |
| 2 GeForce GTX TIT… Off | 00000000:83:00.0 Off | N/A |
| 22% 32C P8 14W / 250W | 563MiB / 12212MiB | 0% Default |
| 3 GeForce GTX TIT… Off | 00000000:84:00.0 Off | N/A |
| 31% 70C P2 122W / 250W | 3411MiB / 12212MiB | 60% Default |

| Processes: GPU Memory |
| GPU PID Type Process name Usage |
| 0 19757 C python 526MiB |
| 1 19757 C python 561MiB |
| 2 19757 C python 542MiB |
| 3 19757 C python 3390MiB |

I have added many print statements to find the position where it freezes and I guess that the ToPILImage() transformation is the problem. The error does only occur when I use num_workers > 0 in my DataLoaders.

I have already seen a few bug reports that had a similar problem when using cv2 in their Datasets (that are used by DataLoader). But I am not using cv2, only torch, torchvision and (indirectly) PIL / pillow. Is there a known issue when using DataLoader within a thread? Is it not possible to run multiple training scripts that all use DataLoaders in parallel?

When I use num_workers = 0, it works, but then it is quite slow.

Here is some relevant code:

        jitterTransform = transforms.ColorJitter(brightness=0.4, contrast=0.4, saturation=0.4, hue=0.1)
        normMean = tuple(np.array([126.55891472,  99.32252242,  97.26477874])/255.0)
        normStd = tuple(np.array([65.72192038, 56.78839458, 59.06476169])/255.0)
        normTransform = transforms.Normalize(normMean, normStd)
        trainTransform = transforms.Compose([

        testTransform = transforms.Compose([

        with h5py.File(, 'r') as f:
            train_dataset = MyDataset(f['train'], transform=trainTransform, flip=True, trans=normTransform, gpu=self.gpu_id)
            val_dataset = MyDataset(f['test'], transform=testTransform, gpu=self.gpu_id)

        dataloader_params = {
            'batch_size' : config['batch_size'],
            'pin_memory' : True,
            'num_workers' : 32 # High value => Freezes faster

        train_loader = DataLoader(train_dataset, shuffle=True, **dataloader_params)
        val_loader = DataLoader(val_dataset, shuffle=False, **dataloader_params)

and MyDataset:

from import Dataset
class MyDataset(Dataset):
    def __init__(self, f, transform=None, flip=False,trans=None):
        self.images, self.coords = f['images'][:], f['coords'][:]
        if transform is None:
            self.images = np.transpose(self.images, (0, 3, 1, 2))   # swap from B x H x W x C to B x C x H x W
        self.transform = transform
        self.maybe_flip = (lambda x: T(x)) if flip else (lambda x: x)

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        if self.transform:
            print("before transform")
            image = self.transform(image)
            print("after transform")
        sample = {'image': image, 'landmarks': landmark}

        return self.maybe_flip(sample)

In the cases where it freezes, I see only the “before transform” output but not the “after transform” output.

Is there a known issue with race conditions / deadlocks / etc when using ToPILimage in multiple threads in parallel?

EDIT: As soon as 3 of my 4 GPU threads freeze, the last one continues running without any problem. This leads to the assumption that my code in general is not wrong, but there are some strange side effects when it is run in more then one thread.

EDIT 2: Even when I only use this transform and num_workers > 0, it still freezes:


So it does not have something to do with PIL? I am completely confused now. It looks like there is some bug in the DataLoader itself when it is used in multiple threads… Even if every thread has its own DataLoader.

Moving this from vision to the uncategorized topic since I think PIL is not involved.

1 Like

Maybe this should be a bug report instead of a forum entry:

I guess it could help if I use real processes instead of threads. Currently each GPU trains on a separate thread, but all is in one process. Will try that tomorrow.

Yes, this solved the problem. I will update the bug report with an explanation:

@PySimon can you share the code and step by step process for fixing this issue (that you followed)?