[Dataloader] ‘Shuffle=True’ makes IO slow

zhaominyiz · April 1, 2020, 6:23am

Thanks everyone.
My dataset contains 15 million images. I have convert them into lmdb format and concat them
At first I set shuffle = False，envery iteration’s IO take no extra cost.
Inorder to improve the performance , I set it into True and use num_workers.

train_data = ConcatDataset([train_data_1,train_data_2])
    train_loader = DataLoader(dataset=train_data, batch_size=64,num_workers=32, shuffle=True,pin_memory=False)
for i in range(epochs):
for j,data in enumerate(train_loader):
                print(i,"ITER",j,"IO END",datetime.now())
                continue

But IO takes too much time.

Is there something I can do to makes IO faster?

JuanFMontesinos · April 1, 2020, 7:10pm

Hi,
Have you evaluated the behaviour using both num_workers=0 in both cases?
Is it possible your data is optimized for sequential reading (im not familiarized with lmdb format)?
Setting shuffle=True does nothing but replacing a sequential generator range(0,len) by a list of random indices.

zhaominyiz · April 13, 2020, 6:39am

Thanks very much.
set num_workers = 0 doesnot work.
It seems that the cloud sever cause the problem. I put the dataset in the SSD of my local machine and solve the problem.