In DataLoader num_workers > 1 shuffles data even when shuffle=False

qwwq215 · May 5, 2018, 8:05am

I am observing this phenomena in PyTorch 0.4.0. Setting num_workers > 1 and shuffle = False, is shuffling the data. I think this is counterintuitive to the users of PyTorch and changes should be made so that num_workers does not shuffle the data.

SimonW · May 5, 2018, 8:13am

i don’t think it shuffles data. where did you see this?

qwwq215 · May 5, 2018, 9:48am

Code snippet to reproduce the issue.

import torch
import torchvision
import torchvision.transforms as transforms

def load_data():
    
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Lambda(lambda z: print(z.min(), z.max()))
    ])

    dataset = torchvision.datasets.ImageFolder('/home/ssd0/images/', transform=transform)
    dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=False, num_workers=4)
    
    return dataloader

for data in load_data():
    pass

The above code ends in error because transforms.Lambda(lambda z: print(z.min(), z.max())) returns NoneType. But if you carefully see the prints, it displays 16 torch tensors of min and max value of each image of the first batch. And this keeps changing every time I run the code. Setting num_workers=1 resolves this problem. I suspect 4 parallel workers are racing against each other. Please tell me if I am doing something wrong.

yz3007 · June 18, 2020, 2:08am

I faced the exactly same problem with you and did the same experiment as you did. It showed that workers cowork in the same batch. Shuffle=False means that the batch order (the big data chunk) is stable, howerver, the datapoints in batch is randomized by several subprocess. You can set num_workers = 0 which means that only main process deals it. By this way, the batch order and the data order in batch are both stable.