Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

You would need to modify it in your system.
Are you using a manual multiprocessing (or multi-node) approach? I’m not sure why your model needs shared memory. Usually the DataLoader uses the shared memory to transfer the loaded data and I’m unsure why you are seeing this issue when switching models.

1 Like

I was using pytorch-lightning framework for DistributedDataParallel (ddp) mode of training.

@ptrblck Here is the detailed error log:
However, if I reduce the num_workers to 0 from 2 then it is working properly, but it increases the training time significantly.

RuntimeError: DataLoader worker (pid 24459) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 389, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 451, in run_training_epoch
    self.run_evaluation(test_mode=self.testing)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 24459) exited unexpectedly

This error message points indeed to the shared memory limit for multiprocessing, so you would need to increase it.

1 Like

Hi, I got the same error,

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 65, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 95106) is killed by signal: Segmentation fault.

what is interest is

when I just parse the data line by line, I do not have this issue:

        with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

but if I add a JSON parse logic after read line by line , it will report this error

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break

        return json_data

any clue why it happened if we have JSON parse? I think it is pretty standard process, not specific to pytorch

I searched around and all point to shared memory, however, I am not fully convinced my shm has problem, I tested my shm in docker it is

kernel.shmmax = 18446744073692774399

And also in my local test, even with a very small data sample, 100 lines JSON and worker=2, it still has problem. so wonder what could be issue

I met the same issue in pytorch 1.5.1, and set a new value in /etc/sysctl.conf is not work(the default value of kernel.shmmax is enough = 18446744073692774399).

This time, I use df -h command and found there is a disk named /dev/shm ( shm seems like shared memory, which value is 50% of machine’s memory. Then I remount it by:

mount -o size=yourMemorySize -o nr_inodes=1000000 -o noatime, nodiratime -o remount /dev/shm

This problem is fixed.

By the way, the os I used is CenOS7.

Hi @ptrblck ,

is it possible to have a description that is more precise of what is causing the error to ease debugging? Is it caused by:

  1. my code
  2. the RAM
  3. the GPU memory
  4. filesystem memory/storage

I find that if it’s a recurring issue it’s often useful to state the source of the problem to more effectively fix it.


This solved my problem! Thank you

Hi. I’m suffering the same issue…

Platform info:

  • ubuntu 18.04
  • pytorch 1.8.1
  • cuda version 11.0
  • nvidia driver version 450.102.04
  • NO DOCKER, I use conda environment and (luckily) the only user of this server.

System info:

  • Physical memory: 256GB
  • Shared memory: 1/2 of physical memory ~= 126GB
  • GPU: TITAN V(VRAM: 12GB) * 8ea

Training info:

  • Model: Quantization-aware training applied(library: brevitas) ResNet18
  • Dataset: I tried CIFAR10, CIFAR100, ImageNet, but there’s no error occured with CIFAR10/100. maybe relevant to large dataset.
  • Distribution: Pytorch DDP

Symptoms:
Suddenly only one GPU are released from multiprocessing(memory usage goes 0 and dedicated process are disappeared) and other GPUs locked in semaphore with 100% GPU utils.

I tried num_workers=8, 32, 64 cases but got error when 32 and 64.
I know that more workers consumes more shared memory, but in my cases, It only consumes <1% shared memory. (I monitored it with df -h option)

num_workers=64 is 3.5x faster than 8 cases so I don’t want to give up this option.
Is there any known bugs for num_workers?

1 Like

Hi @FruitVinegar, have you found a solution for this problem? I have the same setup as you and the same problem.

I’m not sure but in my case there were one of DRAM slot problem and bad sector problem on the HDD and after solve 2 hardware issues, the symptom had gone…

Do you have solutions about this problem?
I have the same problem and I also do not use docker.
I am confused about why it still occurs when I do not use docker container.

same here, newest pytorch, ubuntu20, no docker involved:

kernel.shmmax = 18446744073692774399

I try to train in parallel on 2gpus.

Same here, after a few epochs the training crashes because not enough shared memory.
I train on 2 GPUs with a dataloader with 32 workers.
No docker involved, 94Go of /dev/shm which seems like a lot.

Same problem. Training on HPC, single node sbatch job (no docker), 4 gpus, crash during second epoch. Wouldn’t understand why it doesn’t crash on the first one but on during the second.

1 Like

I also encountered same issues.
There is no problem when using 2 gpus, but problems occur from 3 or more.
The issue was resolved by adding the --ipc=host option when creating the container.

But I don’t understand how this problem relates to the shared memory of the container.
Could you please explain why do we need to increase the shared memory size?

I would guess that nn.DataParallel might need additional shared memory, but am not sure as we generally recommend to use DistributedDataParallel for better performance.

1 Like