Data loading index overflow in dataloader

is_Juncheng · July 25, 2023, 11:57am

I use custom coco type data for training, but I encounter index overflow issues during the training process, as shown below:

Traceback (most recent call last):
      File ".../engine/train_loop.py", line 134, in train
        self.run_step()
      File ".../engine/defaults.py", line 429, in run_step
        self._trainer.run_step()
      File ".../engine/train_loop.py", line 222, in run_step
        data = next(self._data_loader_iter)
      File ".../data/common.py", line 179, in __iter__
        for d in self.dataset:
      File ".../python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
        data = self._next_data()
      File ".../python3.9/site-packages/torch/utils/data/dataloader.py", line 1333, in _next_data
        return self._process_data(data)
      File ".../python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
        data.reraise()
      File ".../python3.9/site-packages/torch/_utils.py", line 543, in reraise
        raise exception
    IndexError: Caught IndexError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File ".../python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
        data = fetcher.fetch(index)
      File ".../python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File ".../python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp>
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File ".../data/common.py", line 43, in __getitem__
        data = self._map_func(self._dataset[cur_idx])
      File ".../data/common.py", line 107, in __getitem__
        start_addr = 0 if idx == 0 else self._addr[idx - 1].item()
    IndexError: index 102724527 is out of bounds for axis 0 with size 15000

IndexError: index 161312927 is out of bounds for axis 0 with size 15000

I have traced the issue, which is caused by data index loading; But I’m not sure why this problem occurs.
But interestingly, when I set num_workers to 1, the program will not report an error, but when I set num_workers to greater than 1, an error will be reported during the training.

ptrblck · July 25, 2023, 8:42pm

The index passed to Dataset.__getitem__ is created by the sampler. Assuming you are not creating a custom sampler the range of valid indices will be in [0, len(dataset)], which is also corresponding to the number of samples in the dataset. How many samples do you expect to have in your currently used dataset and would the index fit into the range?

is_Juncheng · July 26, 2023, 12:25am

thank you for your reply. But it’s strange that when the num_workers is set to 1, the index overflow problem will not occur?

ptrblck · July 26, 2023, 12:37am

Yes, this would be strange but I would still recommend checking the length etc. unless you can provide a minimal and executable code snippet showing the indexing error is only raised in a multiprocessing setup.

is_Juncheng · July 26, 2023, 12:59am

The error message “indexerror: index 161312927 is out of bounds for axis 0 with size 15000” shows that the loader captured 15000 images in the dataset range, and the dataset I customized is indeed 15000 images.

ptrblck · July 26, 2023, 1:14am

No, the error message points to the indexing error and the size of the tensor. It does not tell you anything about the Dataset and its length.
Here is a small example, which raises an error due to the invalid __len__:

class MyDataset(Dataset):
    def __init__(self):
        self.data = torch.randn(10, 100)
        
    def __getitem__(self, index):
        x = self.data[index]
        return x
    
    def __len__(self):
        # WRONG since size(0) should be used
        return self.data.size(1)

dataset = MyDataset()
loader = DataLoader(dataset, batch_size=1)

for x in loader:
    print(x)
#     x = self.data[index]
# IndexError: index 10 is out of bounds for dimension 0 with size 10

It seems you might have other ideas how to debug the issue, so keep us updated!

delruin · January 24, 2024, 7:19pm

is_Juncheng did you have any luck with this issue?