I am observing this phenomena in PyTorch 0.4.0. Setting num_workers > 1 and shuffle = False, is shuffling the data. I think this is counterintuitive to the users of PyTorch and changes should be made so that num_workers does not shuffle the data.
i don’t think it shuffles data. where did you see this?
Code snippet to reproduce the issue.
import torch
import torchvision
import torchvision.transforms as transforms
def load_data():
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Lambda(lambda z: print(z.min(), z.max()))
])
dataset = torchvision.datasets.ImageFolder('/home/ssd0/images/', transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=16, shuffle=False, num_workers=4)
return dataloader
for data in load_data():
pass
The above code ends in error because transforms.Lambda(lambda z: print(z.min(), z.max()))
returns NoneType
. But if you carefully see the prints, it displays 16 torch tensors of min and max value of each image of the first batch. And this keeps changing every time I run the code. Setting num_workers=1 resolves this problem. I suspect 4 parallel workers are racing against each other. Please tell me if I am doing something wrong.
I faced the exactly same problem with you and did the same experiment as you did. It showed that workers cowork in the same batch. Shuffle=False means that the batch order (the big data chunk) is stable, howerver, the datapoints in batch is randomized by several subprocess. You can set num_workers = 0 which means that only main process deals it. By this way, the batch order and the data order in batch are both stable.