DataLoader efficiency with multiple workers


I have noticed that my dataloader gets slower if I add more workers compared to num_workers=0.

My dataset definition is quite simple:

class Dataset(
    def __init__(self, file_paths, labels, transform=None):

        self.file_paths = file_paths
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, index):

        path_img = self.file_paths[index]
        image = accimage.Image(path_img)
        target = self.labels[index]

        if self.transform is not None:
            image = self.transform(image)

        return image, target

file_paths and labels can be quite large (above 1 million entries each). So to my understanding, the dataloader would push the dataset onto a worker every time it is indexed? Could it be slow because this creates an overhead?

If this is indeed the reason for slow performance, how could i work around that? Would it be possible to only parallelize loading and transforming an image where I send a path, the label and the transforms to a worker?

Sorry if this just reveals severe misconceptions about how parallelism is implemented here (or works conceptually in general). My setup is a single CPU with 4 connected GPUs, if that is relevant.


If you are using multiple workers, the Dataset will be copied, if I’m not mistaken.
The first iteration would include these copies as well as the first batch creation in each process, which might be slow.
However, the following iterations should be faster.
Are you consistently seeing a slowdown using num_workers>=1 compared to num_workers=0?

Generally, this post is really helpful when it comes to data loading bottlenecks, but your issue seems to be unrelated.
Are you storing the data on a local SSD or somewhere else?

1 Like

Thank you, I will look into that!