Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

ptrblck · June 9, 2020, 4:34am

You would need to modify it in your system.
Are you using a manual multiprocessing (or multi-node) approach? I’m not sure why your model needs shared memory. Usually the DataLoader uses the shared memory to transfer the loaded data and I’m unsure why you are seeing this issue when switching models.

akashs · June 9, 2020, 6:01pm

I was using pytorch-lightning framework for DistributedDataParallel (ddp) mode of training.

akashs · June 10, 2020, 2:02am

@ptrblck Here is the detailed error log:
However, if I reduce the num_workers to 0 from 2 then it is working properly, but it increases the training time significantly.

RuntimeError: DataLoader worker (pid 24459) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 389, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 451, in run_training_epoch
    self.run_evaluation(test_mode=self.testing)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 24459) exited unexpectedly

ptrblck · June 10, 2020, 5:33am

This error message points indeed to the shared memory limit for multiprocessing, so you would need to increase it.

rui_zhang_331 · June 16, 2020, 4:24am

Hi, I got the same error,

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 65, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 95106) is killed by signal: Segmentation fault.

what is interest is

when I just parse the data line by line, I do not have this issue:

        with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

but if I add a JSON parse logic after read line by line , it will report this error

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break

        return json_data

any clue why it happened if we have JSON parse? I think it is pretty standard process, not specific to pytorch

I searched around and all point to shared memory, however, I am not fully convinced my shm has problem, I tested my shm in docker it is

kernel.shmmax = 18446744073692774399

And also in my local test, even with a very small data sample, 100 lines JSON and worker=2, it still has problem. so wonder what could be issue

maqy · September 14, 2020, 9:33am

I met the same issue in pytorch 1.5.1, and set a new value in /etc/sysctl.conf is not work(the default value of kernel.shmmax is enough = 18446744073692774399).

This time, I use df -h command and found there is a disk named /dev/shm ( shm seems like shared memory, which value is 50% of machine’s memory. Then I remount it by:

mount -o size=yourMemorySize -o nr_inodes=1000000 -o noatime, nodiratime -o remount /dev/shm

This problem is fixed.

By the way, the os I used is CenOS7.

Brando_Miranda · October 14, 2020, 6:15pm

Hi @ptrblck ,

is it possible to have a description that is more precise of what is causing the error to ease debugging? Is it caused by:

my code
the RAM
the GPU memory
filesystem memory/storage

I find that if it’s a recurring issue it’s often useful to state the source of the problem to more effectively fix it.

github.com/pytorch/pytorch

Give a better error when we run out of shared memory, instead of "RuntimeError: DataLoader worker (pid 13) is killed by signal: Bus error."

opened 05:03AM - 05 Feb 18 UTC

closed 12:27PM - 14 Sep 19 UTC

miraclewkf

high priority module: docs module: dataloader triaged

When I set `num_workers=1` or other value greater than 0 in `torch.utils.data.Da…taLoader`, I get this error. The detail of the error: ``` Traceback (most recent call last): File "/opt/project/train.py", line 150, in <module> dataset_sizes=dataset_sizes) File "/opt/project/train.py", line 51, in train_model outputs = model(inputs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 64, in forward inputs, kwargs = self.scatter(inputs, kwargs, self.device_ids) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 75, in scatter return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 30, in scatter_kwargs inputs = scatter(inputs, target_gpus, dim) if inputs else [] File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 25, in scatter return scatter_map(inputs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 18, in scatter_map return list(zip(*map(scatter_map, obj))) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 15, in scatter_map return Scatter.apply(target_gpus, None, dim, obj) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward outputs = comm.scatter(input, ctx.target_gpus, ctx.chunk_sizes, ctx.dim, streams) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/cuda/comm.py", line 189, in scatter outputs.append(chunk.cuda(device, async=True)) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/_utils.py", line 69, in _cuda return new_type(self.size()).copy_(self, async) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 172, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 13) is killed by signal: Bus error. ```

wenm12222 · March 5, 2021, 7:45am

This solved my problem! Thank you

FruitVinegar · April 13, 2021, 11:04am

Hi. I’m suffering the same issue…

Platform info:

ubuntu 18.04
pytorch 1.8.1
cuda version 11.0
nvidia driver version 450.102.04
NO DOCKER, I use conda environment and (luckily) the only user of this server.

System info:

Physical memory: 256GB
Shared memory: 1/2 of physical memory ~= 126GB
GPU: TITAN V(VRAM: 12GB) * 8ea

Training info:

Model: Quantization-aware training applied(library: brevitas) ResNet18
Dataset: I tried CIFAR10, CIFAR100, ImageNet, but there’s no error occured with CIFAR10/100. maybe relevant to large dataset.
Distribution: Pytorch DDP

Symptoms:
Suddenly only one GPU are released from multiprocessing(memory usage goes 0 and dedicated process are disappeared) and other GPUs locked in semaphore with 100% GPU utils.

I tried num_workers=8, 32, 64 cases but got error when 32 and 64.
I know that more workers consumes more shared memory, but in my cases, It only consumes <1% shared memory. (I monitored it with df -h option)

num_workers=64 is 3.5x faster than 8 cases so I don’t want to give up this option.
Is there any known bugs for num_workers?

Joslefaure · July 27, 2021, 12:23am

Hi @FruitVinegar, have you found a solution for this problem? I have the same setup as you and the same problem.

FruitVinegar · August 12, 2021, 11:34am

I’m not sure but in my case there were one of DRAM slot problem and bad sector problem on the HDD and after solve 2 hardware issues, the symptom had gone…

LeungWaiHo · September 29, 2021, 12:19am

Do you have solutions about this problem?
I have the same problem and I also do not use docker.
I am confused about why it still occurs when I do not use docker container.

neuronflow · October 4, 2021, 10:02am

same here, newest pytorch, ubuntu20, no docker involved:

kernel.shmmax = 18446744073692774399

I try to train in parallel on 2gpus.

Louis1 · October 18, 2021, 9:24am

Same here, after a few epochs the training crashes because not enough shared memory.
I train on 2 GPUs with a dataloader with 32 workers.
No docker involved, 94Go of /dev/shm which seems like a lot.

juanigp · October 28, 2021, 7:33am

Same problem. Training on HPC, single node sbatch job (no docker), 4 gpus, crash during second epoch. Wouldn’t understand why it doesn’t crash on the first one but on during the second.

son · April 8, 2022, 3:11am

I also encountered same issues.
There is no problem when using 2 gpus, but problems occur from 3 or more.
The issue was resolved by adding the --ipc=host option when creating the container.

But I don’t understand how this problem relates to the shared memory of the container.
Could you please explain why do we need to increase the shared memory size?

ptrblck · April 8, 2022, 5:36am

I would guess that nn.DataParallel might need additional shared memory, but am not sure as we generally recommend to use DistributedDataParallel for better performance.

luojing · November 2, 2022, 2:09am

hello,have you solved this problem?

luojing · November 2, 2022, 2:15am

I ran the same code on other servers without this error

ZenoB · January 5, 2023, 12:55pm

I have the same error with a conda environment. kernel.shmmax=18446744073692774399 and if I watch df -h I can see dev/shm saturate and the code breaking. The pseudo-code that generates the error is:

dl = DataLoader(X, batch_size=batch_size, num_workers=16)
tmp = []
for i, batch in enumerate(dl):
    tmp.append(batch.numpy())
    [...]

If I change tmp to an empty numpy array and iteratively populate it with the current batch the shared memory does not saturate. The working pseudo code is:

dl = DataLoader(X, batch_size=batch_size, num_workers=16)
tmp = np.empty([desired_num_rows, X.shape[1], np.float32)
for i, batch in enumerate(dl):
    t_from = i * batch.shape[0]
    t_to = (i+1) * batch.shape[0]
    tmp[t_from:t_to,:] = batch.numpy()
    [...]

Alternativley, also this suggestion works.