Training crashes due to - Insufficient shared memory (shm) - nn.DataParallel

I have this issue on Pytorch 1.1.0 too. Any example for increasing the size of shared memory? thanks!

I have this issue on Pytorch 1.1.0 too. Is there any way for users to set the size of shared memory? Thanks!

If you are using ubuntu, you could check the max shared memory size via:

sysctl kernel.shmmax

and set a new value in /etc/sysctl.conf as:

kernel.shmmax=6400000
1 Like

Thanks! When I execute the command:
sysctl kernel.shmmax
The result is:
18446744073692774399
Does that mean the value of shmmax in my system is big enough?

It might be bin enough.
Which errors are you seeing that you assume your shared memory is not large enough?

The error message is the same as ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)
When I set num_works=0, the code can run normall.

1 Like

I get the same situation with next(iter(data_loader)) (My /dev/shm is 256G). Set num_workers=0 indeed can fix this, but num_workers=0 will take more time to load datas, there is an issue of this situation, https://github.com/pytorch/pytorch/issues/13246, but can we have a better solution ?

2 Likes

For me the issue was that I was already converting numpy arrays to torch tensors in the dataloader __getitem__

Numpy arrays should only be converted to torch tensors in the trainer loop, just before being sent to the model. Otherwise the tensors will make the shared memory grow out of bounds.

You can monitor the shared memory by running the command watch -n .3 df -h
The shared memory corresponds to the line /dev/shm
The used amount should not increase after each epoch.

3 Likes

I was always under impression that arrays should be converted to tenors in __getitem__. It’s shown in the tutorial: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

That would make some sense since some kinds of data cannot be gathered into array until the collate_fn, e.g. text data. Why would they make the memory grow out of bound? I thought that CPU tensors are just wrappers for ndarray.

1 Like

@ptrblck am still facing this error of shared memory is not large enough. I face this issue when I use large models. For example, If I use four resnet50 as a sub-models in a single large model, then I face this issue. However, if I change the four resnet50 to four resnet18 in that same single large model, then I didn’t face this shared memory issue. Is there any way I can increase the shared memory in PyTorch? or do I need to modify the UNIX system? Thanks in advance.

You would need to modify it in your system.
Are you using a manual multiprocessing (or multi-node) approach? I’m not sure why your model needs shared memory. Usually the DataLoader uses the shared memory to transfer the loaded data and I’m unsure why you are seeing this issue when switching models.

1 Like

I was using pytorch-lightning framework for DistributedDataParallel (ddp) mode of training.

@ptrblck Here is the detailed error log:
However, if I reduce the num_workers to 0 from 2 then it is working properly, but it increases the training time significantly.

RuntimeError: DataLoader worker (pid 24459) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 389, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
    self.train()
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
    self.run_training_epoch()
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 451, in run_training_epoch
    self.run_evaluation(test_mode=self.testing)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
    eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
  File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
    for batch_idx, batch in enumerate(dataloader):
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
    idx, data = self._get_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
    success, data = self._try_get_data()
  File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 24459) exited unexpectedly

This error message points indeed to the shared memory limit for multiprocessing, so you would need to increase it.

1 Like

Hi, I got the same error,

ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
    data = self.data_queue.get(timeout=timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 65, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 95106) is killed by signal: Segmentation fault.

what is interest is

when I just parse the data line by line, I do not have this issue:

        with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

but if I add a JSON parse logic after read line by line , it will report this error

with open(current_file, mode='rb') as f:
            text = f.read().decode('utf-8')
            all_data.extend(text.split('\n'))

        json_data = []
        for line in all_data:
            try:
                json_data.append(json.loads(line))
            except:
                break

        return json_data

any clue why it happened if we have JSON parse? I think it is pretty standard process, not specific to pytorch

I searched around and all point to shared memory, however, I am not fully convinced my shm has problem, I tested my shm in docker it is

kernel.shmmax = 18446744073692774399

And also in my local test, even with a very small data sample, 100 lines JSON and worker=2, it still has problem. so wonder what could be issue

I met the same issue in pytorch 1.5.1, and set a new value in /etc/sysctl.conf is not work(the default value of kernel.shmmax is enough = 18446744073692774399).

This time, I use df -h command and found there is a disk named /dev/shm ( shm seems like shared memory, which value is 50% of machine’s memory. Then I remount it by:

mount -o size=yourMemorySize -o nr_inodes=1000000 -o noatime, nodiratime -o remount /dev/shm

This problem is fixed.

By the way, the os I used is CenOS7.

Hi @ptrblck ,

is it possible to have a description that is more precise of what is causing the error to ease debugging? Is it caused by:

  1. my code
  2. the RAM
  3. the GPU memory
  4. filesystem memory/storage

I find that if it’s a recurring issue it’s often useful to state the source of the problem to more effectively fix it.


This solved my problem! Thank you

Hi. I’m suffering the same issue…

Platform info:

  • ubuntu 18.04
  • pytorch 1.8.1
  • cuda version 11.0
  • nvidia driver version 450.102.04
  • NO DOCKER, I use conda environment and (luckily) the only user of this server.

System info:

  • Physical memory: 256GB
  • Shared memory: 1/2 of physical memory ~= 126GB
  • GPU: TITAN V(VRAM: 12GB) * 8ea

Training info:

  • Model: Quantization-aware training applied(library: brevitas) ResNet18
  • Dataset: I tried CIFAR10, CIFAR100, ImageNet, but there’s no error occured with CIFAR10/100. maybe relevant to large dataset.
  • Distribution: Pytorch DDP

Symptoms:
Suddenly only one GPU are released from multiprocessing(memory usage goes 0 and dedicated process are disappeared) and other GPUs locked in semaphore with 100% GPU utils.

I tried num_workers=8, 32, 64 cases but got error when 32 and 64.
I know that more workers consumes more shared memory, but in my cases, It only consumes <1% shared memory. (I monitored it with df -h option)

num_workers=64 is 3.5x faster than 8 cases so I don’t want to give up this option.
Is there any known bugs for num_workers?

1 Like

Hi @FruitVinegar, have you found a solution for this problem? I have the same setup as you and the same problem.