Troubleshooting intermittent input/output errors in DDP

I’m experiencing an intermittent input/output error trying to train a ResNet-50 classifier on ImageNet. I run DDP and distribute the workload over 4 GPU:s. Some runs I start work fine, others fail, often in the first epoch, and I’m struggling to figure out why. In the last few months, I haven’t seen any problems, but in the last few weeks almost half the runs fail. In general, how does one approach troubleshooting these types of issues?

The only thing I could think of that could cause this sudden jump in problems is that I recently increased the batch size quite a bit. I also sometimes start multiple runs in parallel, which obviously work on the same dataset. Could this cause some form of “file lock” where different runs are trying to read the same image at the same time? That would explain why these issues are intermittent, sometimes the runs work fine, and it’s gotten worse lately. If not, I’d greatly appreciate any pointers.

Some things I have tried (unsuccessfully):

  • Reducing the number of workers in the DataLoader (though not to 0 due to performance)
  • Iterating through all images in my local copy of ImageNet and seeing that they are not corrupted by using img.verify() from the pillow library
  • Reducing the OMP_NUM_THREADS environment variable (though not to 0 due to performance)

Error print

(The rank and DataLoader worker process number changes from failed run to failed run.)


[rank1]: OSError: Caught OSError in DataLoader worker process 0.
[rank1]: Original Traceback (most recent call last):
[rank1]: File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py”, line 308, in _worker_loop
[rank1]: data = fetcher.fetch(index) # type: ignore[possibly-undefined]
[rank1]: File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py”, line 51, in fetch
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py”, line 51, in
[rank1]: data = [self.dataset[idx] for idx in possibly_batched_index]
[rank1]: File “/path/to/dataloader.py”, line 85, in __getitem__
[rank1]: x, y = self.subset[index]
[rank1]: File “/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataset.py”, line 411, in __getitem__
[rank1]: return self.dataset[self.indices[idx]]
[rank1]: File “/usr/local/lib/python3.10/dist-packages/torchvision/datasets/folder.py”, line 245, in __getitem__
[rank1]: sample = self.loader(path)
[rank1]: File “/usr/local/lib/python3.10/dist-packages/torchvision/datasets/folder.py”, line 284, in default_loader
[rank1]: return pil_loader(path)
[rank1]: File “/usr/local/lib/python3.10/dist-packages/torchvision/datasets/folder.py”, line 262, in pil_loader
[rank1]: with open(path, “rb”) as f:
[rank1]: OSError: [Errno 5] Input/output error: ‘/path/to/file.jpeg’