When I train a encoder-vq-decoder model with fairseq( a seq2seq toolkit which was built by Meta), I encounter some confusing, reproducible bug about dataloader when I train the model on a large scale dataset(6.3T and 13 million pieces of data, and I split them every 1000 with a chunk first. There are 30,000 chunks) and in the Azure AI Platform.

The detailed sys information is listed in the public dockerfile: Dockerfile
And I used two nodes with 16-16GV100 GPUs

Here are my detailed bug report, and I will introduce some methods I tried(the script will crashed in 100,000 steps? or more than this number.) I alse used a small dataset to train and encounter this bug in 40 epochs. So I don’t think it’s data problem

Traceback (most recent call last):
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/", line 1163, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/envs/fairseq/lib/python3.9/", line 180, in get
  File "/opt/conda/envs/fairseq/lib/python3.9/", line 316, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/_utils/", line 66, in handler
RuntimeError: DataLoader worker (pid 228250) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/amlt_code/fairseq_cli/", line 27, in hydra_main
  File "/scratch/amlt_code/fairseq_cli/", line 56, in _hydra_main
    distributed_utils.call_main(cfg, pre_main, **kwargs)
  File "/scratch/amlt_code/fairseq/distributed/", line 354, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/scratch/amlt_code/fairseq/distributed/", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/scratch/amlt_code/fairseq_cli/", line 190, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/opt/conda/envs/fairseq/lib/python3.9/", line 79, in inner
    return func(*args, **kwds)
  File "/scratch/amlt_code/fairseq_cli/", line 312, in train
    for i, samples in enumerate(progress):
  File "/scratch/amlt_code/fairseq/logging/", line 202, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/scratch/amlt_code/fairseq/data/", line 57, in __next__
    x = next(self._itr)
  File "/scratch/amlt_code/fairseq/data/", line 637, in _chunk_iterator
    for x in itr:
  File "/scratch/amlt_code/fairseq/data/", line 57, in __next__
    x = next(self._itr)
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/", line 1359, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/", line 1315, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/", line 1176, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 228250) exited unexpectedly

And here are my used methods:

  1. increase my container shm, and my container shm is 350G. I also check the shm change( I think it isn’t this reason
  2. change num_workers: 4->2->0 (always encounter this problem)

Could you please help me fix this bug or give me some debug method? I also check the following issues in pytorch. I think a detailed,easy-understanding bug report is necessary.

