How to iterate over VCTK dataset elements?

YK11 · May 3, 2018, 9:44am

Hello!
I am trying to go through VCTK dataset in this way:

train_set = datasets.VCTK(root = 'processed/training.pt', download = True, transform = transforms.PadTrim(max_len=30000))
training_data_loader = DataLoader(dataset = train_set,
            num_workers=opt.nThreads, batch_size=opt.batchSize,shuffle=True)

for batch_idx, batch in enumerate(training_data_loader):
    print(batch_idx)
    ........

However, it prints only 0 and shows the following error:

0
Traceback (most recent call last):
  File "main_audio.py", line 153, in <module>
    for batch_idx, batch in enumerate(training_data_loader):
  File "/mnt/home/20140941/.conda/envs/opt_anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 281, in __next__
    return self._process_next_batch(batch)
  File "/mnt/home/20140941/.conda/envs/opt_anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 301, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
  File "/mnt/home/20140941/.conda/envs/opt_anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "build/bdist.linux-x86_64/egg/torchaudio/datasets/vctk.py", line 126, in __getitem__
    audio, target = self.data[index], self.labels[index]
IndexError: tuple index out of range

How can I solve this problem?

dhpollack · May 4, 2018, 8:47am

I assume you are using the torchaudio library. The root should be a folder not a file. Otherwise, you’ll have to make a custom torch.utils.data.Dataset. Also, did you check to see that it actually downloads the files? The VCTK dataset is very large and takes a long time to download. You should probably just run the first line in a REPL and see if the dataset gets downloaded correctly.

YK11 · May 6, 2018, 6:18am

Thanks for your reply! I fixed root and all data was downloaded. However, there is still problem.

train_set = datasets.VCTK(root = '.', download = True, transform = transforms.PadTrim(max_len=30000))
training_data_loader = DataLoader(dataset = train_set,
            num_workers=opt.nThreads, batch_size=2,shuffle=True)

for batch_idx, batch in enumerate(training_data_loader, 0):
    print(batch_idx)

The length of training set is 44257. When I run code it prints integers from 1 to 20 (supposed to print to 22129). And shows similar errror.

    for batch_idx, batch in enumerate(training_data_loader, 0):
  File "/mnt/home/20140941/.conda/envs/opt_anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 281, in __next__
    return self._process_next_batch(batch)
  File "/mnt/home/20140941/.conda/envs/opt_anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 301, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
IndexError: Traceback (most recent call last):
  File "/mnt/home/20140941/.conda/envs/opt_anaconda/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 55, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "build/bdist.linux-x86_64/egg/torchaudio/datasets/vctk.py", line 126, in __getitem__
    audio, target = self.data[index], self.labels[index]
IndexError: tuple index out of range

Please, help!

tks · August 13, 2019, 5:08pm

Did you ever get an answer for this? I’m running into the exact same issue now…

It seems to work as expected when I just iterate through the dataset directly though (rather than using the loader).