DataLoader hangs with custom DataSet

Hello,

I face a problem with DataLoader and custom DataSet.

Here is my custom DataSet :

class AudioDataset(Dataset):

    def __init__(self, dataset_path: str) -> None:
        super().__init__()

        assert isdir(dataset_path)

        all_magn = [
            f for f in tqdm(listdir(dataset_path))
            if isfile(join(dataset_path, f)) and
               f.startswith("magn")
        ]

        all_phase = [
            f for f in tqdm(listdir(dataset_path))
            if isfile(join(dataset_path, f)) and
               f.startswith("phase")
        ]

        assert len(all_magn) == len(all_phase)

        self.__all_magn = sorted(all_magn)
        self.__all_phase = sorted(all_phase)

        self.__dataset_path = dataset_path

    def __getitem__(self, index: int):
        magn = th.load(join(
            self.__dataset_path,
            self.__all_magn[index]
        ))

        phase = th.load(join(
            self.__dataset_path,
            self.__all_phase[index]
        ))

        return th.stack([magn, phase], dim=0)

    def __len__(self):
        return len(self.__all_magn)

wich is loaded with :

if __name__ == "__main__":
    audio_dataset = audio.AudioDataset("/path/to/tensor/dir")

    data_loader = DataLoader(
        audio_dataset,
        batch_size=8,
        shuffle=True,
        num_workers=10,
        drop_last=True
    )

The data is well loaded but in fact the DataLoader hangs when iterating. It seems that the “speed” of loading is not constant (my dataset is +60k tensors of size = (512, 512) ) : it varies from 20min to 1h to make an epoch.

I precise that the “speed” of iteration is constant when I specify num_workers = 0.

I’ve seen that this issue is quite common, how remediate to those hang ?

Python = problem with both 3.6 or 3.8
Pytorch = 1.9.0
CUDA = 11.1
Nvidia driver = 460.84
Ubuntu 20.04

Best regards

You could profile the DataLoader (with num_workers>0) and check, if you are seeing spikes in the data loading time. If so, it would point towards a data loading bottleneck, which would cause the training loop to wait for the next available batch.
This post explains common bottlenecks and proposes some workarounds, in case you are indeed seeing this issue.