DataLoader very slow when dataset is big and shuffle is True

I have a big dataset with lots of images and I found that the speed of dataloader is very slow. I did many tests and found when the number of images is big:

  1. Direct read the dataset is fast
  2. Set shuffle = False with num_workers=0, also fast (1.1 times slower than 1st one)
  3. Set shuffle = False with num_workers=8, becomes slow (2.8 times slower than 1st one)
  4. Set shuffle = True with num_workers=0, slower (7.2 times slower than 1st one)
  5. Set shuffle = True with num_workers=8,(9.2 times slower than 1st one)

Here is the sample code:

dataset = MyTestDataset()

since = time.time()
imgs = []
for idx in range(128):
    img,label = dataset[10000+idx*1000]
    imgs.append(img)
base = time.time() - since
print("next dataset",base)

loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=False, num_workers=0)
since = time.time()
imgs = next(iter(loader))
cost = time.time() - since
print("loader no shuffle num_workers-0",cost,cost/base)

loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=False, num_workers=8)
since = time.time()
imgs = next(iter(loader))
cost = time.time() - since
print("loader no shuffle num_workers-8",cost,cost/base)

loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=True, num_workers=0)
since = time.time()
imgs = next(iter(loader))
cost = time.time() - since
print("loader shuffle num_workers-0",cost,cost/base)

dataset = MyTestDataset()
loader = torch.utils.data.DataLoader(dataset, batch_size=128, shuffle=True, num_workers=8)
since = time.time()
imgs = next(iter(loader))
cost = time.time() - since
print("loader shuffle num_workers-8",cost,cost/base)

and below is the output

next dataset 0.2839939594268799
loader no shuffle num_workers-0 0.3133578300476074 1.103396109832709
loader no shuffle num_workers-8 0.811976432800293 2.859132759157693
loader shuffle num_workers-0 2.041795253753662 7.189572827091643
loader shuffle num_workers-8 2.617912769317627 9.218198776483705

Seems shuffle make it 9 times slow?