RuntimeError: DataLoader worker (pid 228250) is killed by signal: Aborted

When I train a encoder-vq-decoder model with fairseq( a seq2seq toolkit which was built by Meta), I encounter some confusing, reproducible bug about dataloader when I train the model on a large scale dataset(6.3T and 13 million pieces of data, and I split them every 1000 with a chunk first. There are 30,000 chunks) and in the Azure AI Platform.

The detailed sys information is listed in the public dockerfile: Dockerfile
And I used two nodes with 16-16GV100 GPUs

Here are my detailed bug report, and I will introduce some methods I tried(the script will crashed in 100,000 steps? or more than this number.) I alse used a small dataset to train and encounter this bug in 40 epochs. So I don’t think it’s data problem

/scratch/amlt_code/examples/encodec/modules/quantization/core_vq.py:313: UserWarning: When using RVQ in training model, first check https://github.com/facebookresearch/encodec/issues/25 . The bug wasn't fixed here for reproducibility.
  warnings.warn('When using RVQ in training model, first check '
/scratch/amlt_code/examples/encodec/criterion/audio_to_mel.py:24: UserWarning: Empty filters detected in mel frequency basis. Some channels will produce empty responses. Try increasing your sampling rate (and fmax) or reducing n_mels.
  mel_basis = librosa_mel_fn(sr=sampling_rate,n_fft=n_fft,n_mels=n_mel_channels,fmin=mel_fmin,fmax=mel_fmax)
/scratch/amlt_code/fairseq/tasks/fairseq_task.py:582: UserWarning: ntokens not found in Criterion logging outputs, cannot log wpb or wps
  warnings.warn(
/scratch/amlt_code/fairseq/tasks/fairseq_task.py:591: UserWarning: nsentences not found in Criterion logging outputs, cannot log bsz
  warnings.warn(
Traceback (most recent call last):
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/envs/fairseq/lib/python3.9/queue.py", line 180, in get
    self.not_empty.wait(remaining)
  File "/opt/conda/envs/fairseq/lib/python3.9/threading.py", line 316, in wait
    gotit = waiter.acquire(True, timeout)
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 228250) is killed by signal: Aborted. 

The above exception was the direct cause of the following exception(warning is unnecessary):

Traceback (most recent call last):
  File "/scratch/amlt_code/fairseq_cli/hydra_train.py", line 27, in hydra_main
    _hydra_main(cfg)
  File "/scratch/amlt_code/fairseq_cli/hydra_train.py", line 56, in _hydra_main
    distributed_utils.call_main(cfg, pre_main, **kwargs)
  File "/scratch/amlt_code/fairseq/distributed/utils.py", line 354, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/scratch/amlt_code/fairseq/distributed/utils.py", line 328, in distributed_main
    main(cfg, **kwargs)
  File "/scratch/amlt_code/fairseq_cli/train.py", line 190, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/opt/conda/envs/fairseq/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/scratch/amlt_code/fairseq_cli/train.py", line 312, in train
    for i, samples in enumerate(progress):
  File "/scratch/amlt_code/fairseq/logging/progress_bar.py", line 202, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/scratch/amlt_code/fairseq/data/iterators.py", line 57, in __next__
    x = next(self._itr)
  File "/scratch/amlt_code/fairseq/data/iterators.py", line 637, in _chunk_iterator
    for x in itr:
  File "/scratch/amlt_code/fairseq/data/iterators.py", line 57, in __next__
    x = next(self._itr)
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
    data = self._next_data()
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1176, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 228250) exited unexpectedly

And here are my used methods:

  1. increase my container shm, and my container shm is 350G. I also check the shm change( I think it isn’t this reason
  2. change num_workers: 4->2->0 (always encounter this problem)

Could you please help me fix this bug or give me some debug method? I also check the following issues in pytorch. I think a detailed,easy-understanding bug report is necessary.

  1. Give a better error when we run out of shared memory, instead of "RuntimeError: DataLoader worker (pid 13) is killed by signal: Bus error.
  2. Multithreaded Generator Error when in Debug
  3. Useless Exception traces when DataSet timing out
  4. Training got stuck due to timeout from dataloader