Weird IndexError during validation

Getting this strange error during validation. Can’t find the same issue online. Are some of my batches being empty somehow?

Traceback (most recent call last):
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/tbhakta/internal-geometry/src/train.py", line 199, in <module>
    main(cfg)
  File "/home/tbhakta/internal-geometry/src/train.py", line 187, in main
    **cfg.training.fit,
  File "/home/tbhakta/internal-geometry/src/training/runner.py", line 314, in fit
    verbose=verbose,
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/home/tbhakta/internal-geometry/src/training/runner.py", line 343, in evaluate
    for i, batch in enumerate(dataloader):
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1183, in _next_data
    return self._process_data(data)
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
IndexError: Caught IndexError in DataLoader worker process 12.
Original Traceback (most recent call last):
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/home/tbhakta/internal-geometry/src/train.py", line 39, in my_collate
    return torch.utils.data.dataloader.default_collate(batch)
  File "/home/tbhakta/anaconda3/envs/test/lib/python3.6/site-packages/torch/utils/data/_utils/collate.py", line 46, in default_collate
    elem = batch[0]
IndexError: list index out of range

Based on the error message it seems the collate_fn fails to index the returned batch samples.
Are you able to iterate the DataLoader for an entire epoch without any model training? If not, could you use num_workers=0 and retry it?

training works fine for a full epoch. The loss spike issue i mentioned earlier was unrelated and I resolved it. Validation after epoch 1 is where I am seeing the error which is what is so perplexing.

Here is some of my training config:

  # loaders
  train_dataloader:
    batch_size: 40
    shuffle: true
    drop_last: true
    pin_memory: true
    num_workers: 8

  valid_dataloader:
    batch_size: 1
    shuffle: false
    drop_last: false
    pin_memory: true
    num_workers: 16

here’s around where it fails during validation:

valid:  17%|█▋        | 2987/17897 [02:42<13:04, 18.99it/s, val_loss_mask: 0.1158, val_loss: 0.1158, val_mask_micro_iou: 0.7913]
valid:  17%|█▋        | 2988/17897 [02:42<13:00, 19.10it/s, val_loss_mask: 0.1158, val_loss: 0.1158, val_mask_micro_iou: 0.7913]
valid:  17%|█▋        | 2988/17897 [02:42<13:31, 18.38it/s, val_loss_mask: 0.1158, val_loss: 0.1158, val_mask_micro_iou: 0.7913]

no error with num workers set to 0 for valid_dataloader. strange

@ptrblck does this mean I should just stick to using 0 for num workers?

For now you could use it as a workaround, as it’s unclear what’s causing the workers to crash.
In case you are using an older PyTorch version, I would recommend to update to the latest release. Also, to further isolate the issue, could you try to post an executable code snippet, which would reproduce the issue?

1 Like

Updating to 1.9.0 fixed it. Thanks @ptrblck !