Imagenet Training Error after 14 epochs

Rio1210 · January 11, 2021, 1:52pm

I am training Resnet50 on imagenet using the script provided from PyTorch (with a slight trivial tweak for my purpose). However, I am getting the following error after 14 epochs of training. I have allocated 4 gpus in server to run this. Any pointers as to what this error is about would be appreciated. Thanks a lot!

Epoch: [14][5000/5005]	Time 1.910 (2.018)	Data 0.000 (0.191)	Loss 2.6954 (2.7783)	Total 2.6954 (2.7783)	Reg 0.0000	Prec@1 42.969 (40.556)	Prec@5 64.844 (65.368)	 
Test: [0/196]	Time 86.722 (86.722)	Loss 1.9551 (1.9551)	Prec@1 51.562 (51.562)	Prec@5 81.641 (81.641)
Traceback (most recent call last):
  File "main_group.py", line 549, in <module>
  File "main_group.py", line 256, in main
    
  File "main_group.py", line 466, in validate
    if args.gpu is not None:
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 801, in __next__
    return self._process_data(data)
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 846, in _process_data
    data.reraise()
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 11.
Original Traceback (most recent call last):
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 138, in __getitem__
    sample = self.loader(path)
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 174, in default_loader
    return pil_loader(path)
  File "/home/users/rio/anaconda3/envs/ml/lib/python3.7/site-packages/torchvision/datasets/folder.py", line 155, in pil_loader
    with open(path, 'rb') as f:
OSError: [Errno 5] Input/output error: '/data/users2/rio/github/imagenet-data/val/n02102973/ILSVRC2012_val_00009130.JPEG'

ptrblck · January 19, 2021, 9:37am

This error points to a failure in reading the ILSVRC2012_val_00009130.JPEG file. If you’ve mounted the /data folder from a remote server, it could be raised e.g. if the connection fails or times out.
If that’s the case, could you create a local copy of the dataset and rerun the training?

Rio1210 · January 26, 2021, 3:12am

Thank you @ptrblck. I figured out the issue. Apparently the .tar files in the train and val folders were being read /tried to be read by the official PyTorch script. Taking out the .tar files from the train and val folders solved the issue!

Posting it in case someone else might face similar issues!