When I train a encoder-vq-decoder model with fairseq( a seq2seq toolkit which was built by Meta), I encounter some confusing, reproducible bug about dataloader when I train the model on a large scale dataset(6.3T and 13 million pieces of data, and I split them every 1000 with a chunk first. There are 30,000 chunks) and in the Azure AI Platform.
The detailed sys information is listed in the public dockerfile: Dockerfile
And I used two nodes with 16-16GV100 GPUs
Here are my detailed bug report, and I will introduce some methods I tried(the script will crashed in 100,000 steps? or more than this number.) I alse used a small dataset to train and encounter this bug in 40 epochs. So I don’t think it’s data problem
/scratch/amlt_code/examples/encodec/modules/quantization/core_vq.py:313: UserWarning: When using RVQ in training model, first check https://github.com/facebookresearch/encodec/issues/25 . The bug wasn't fixed here for reproducibility.
warnings.warn('When using RVQ in training model, first check '
/scratch/amlt_code/examples/encodec/criterion/audio_to_mel.py:24: UserWarning: Empty filters detected in mel frequency basis. Some channels will produce empty responses. Try increasing your sampling rate (and fmax) or reducing n_mels.
mel_basis = librosa_mel_fn(sr=sampling_rate,n_fft=n_fft,n_mels=n_mel_channels,fmin=mel_fmin,fmax=mel_fmax)
/scratch/amlt_code/fairseq/tasks/fairseq_task.py:582: UserWarning: ntokens not found in Criterion logging outputs, cannot log wpb or wps
warnings.warn(
/scratch/amlt_code/fairseq/tasks/fairseq_task.py:591: UserWarning: nsentences not found in Criterion logging outputs, cannot log bsz
warnings.warn(
Traceback (most recent call last):
File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1163, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/envs/fairseq/lib/python3.9/queue.py", line 180, in get
self.not_empty.wait(remaining)
File "/opt/conda/envs/fairseq/lib/python3.9/threading.py", line 316, in wait
gotit = waiter.acquire(True, timeout)
File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 228250) is killed by signal: Aborted.
The above exception was the direct cause of the following exception(warning is unnecessary):
Traceback (most recent call last):
File "/scratch/amlt_code/fairseq_cli/hydra_train.py", line 27, in hydra_main
_hydra_main(cfg)
File "/scratch/amlt_code/fairseq_cli/hydra_train.py", line 56, in _hydra_main
distributed_utils.call_main(cfg, pre_main, **kwargs)
File "/scratch/amlt_code/fairseq/distributed/utils.py", line 354, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/scratch/amlt_code/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)
File "/scratch/amlt_code/fairseq_cli/train.py", line 190, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/opt/conda/envs/fairseq/lib/python3.9/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/scratch/amlt_code/fairseq_cli/train.py", line 312, in train
for i, samples in enumerate(progress):
File "/scratch/amlt_code/fairseq/logging/progress_bar.py", line 202, in __iter__
for i, obj in enumerate(self.iterable, start=self.n):
File "/scratch/amlt_code/fairseq/data/iterators.py", line 57, in __next__
x = next(self._itr)
File "/scratch/amlt_code/fairseq/data/iterators.py", line 637, in _chunk_iterator
for x in itr:
File "/scratch/amlt_code/fairseq/data/iterators.py", line 57, in __next__
x = next(self._itr)
File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 681, in __next__
data = self._next_data()
File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _next_data
idx, data = self._get_data()
File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1315, in _get_data
success, data = self._try_get_data()
File "/opt/conda/envs/fairseq/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1176, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 228250) exited unexpectedly
And here are my used methods:
- increase my container shm, and my container shm is 350G. I also check the shm change( I think it isn’t this reason
- change num_workers: 4->2->0 (always encounter this problem)
Could you please help me fix this bug or give me some debug method? I also check the following issues in pytorch. I think a detailed,easy-understanding bug report is necessary.