Dataloader is very slow when using shuffle and becomes normal while shuffle is False

horizon_hardyzl · August 9, 2022, 12:55pm

I’m having an issue while training large amounts of audios(about 2 million wav). Dataloader becomes very slow when shuffle is True .When I set Shuffle False, It becomes normal. I rewrite collate_func which can use torchaudio.load to get waveforms from batch(wav_paths).
I wonder if it’s because many wavs locates different folders and it needs take time to find every wav? what should I do if I want to use Shuffle? Thanks!

manudwd · August 9, 2022, 3:30pm

You can try

train_sampler = torch.utils.data.RandomSampler(train_set)

and pass the sampler to the dataloader as

train_loaders = torch.utils.data.DataLoader(train_set, 
                                                shuffle=False,
                                                sampler = train_sampler,
                                                batch_size=train_batch_size,
                                                num_workers=num_workers)

If that is still slow you can define a custom psuedo sampler by looking into the PyTorch RandomSampler.

horizon_hardyzl · August 11, 2022, 7:52am

Thanks for your reply very much ! But unfortunately ,I use RandomSampler and the problem still exists.
Then I random the wav list in my wav_path_file manually,and set the Shuffle False , It’s still slow . So I think it’s the problem of the length of wavs . I use torch.nn.utils.rnn.pad_sequence to pad the wavs_tensor in collate_func . When Shuffle is True , the wavs’s length become so different so the pad_sequence has to pad more and cost more time, and it might be the reason ?
More details:
I set num_workers = 4 and I find the workers pid becomes D . They seem to wait for something.

I try to use **ctrl+c" to interrupt the pids , and It says the pids hang to wait for something I don’t know.

I have searched for similar problems on the Internet but they don’t work . I would appreciate it for your reply! Thanks!