You would need to modify it in your system.
Are you using a manual multiprocessing (or multi-node) approach? I’m not sure why your model needs shared memory. Usually the DataLoader
uses the shared memory to transfer the loaded data and I’m unsure why you are seeing this issue when switching models.
I was using pytorch-lightning framework for DistributedDataParallel (ddp) mode of training.
@ptrblck Here is the detailed error log:
However, if I reduce the num_workers to 0 from 2 then it is working properly, but it increases the training time significantly.
RuntimeError: DataLoader worker (pid 24459) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 389, in ddp_train
self.run_pretrain_routine(model)
File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1015, in run_pretrain_routine
self.train()
File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 347, in train
self.run_training_epoch()
File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 451, in run_training_epoch
self.run_evaluation(test_mode=self.testing)
File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 377, in run_evaluation
eval_results = self._evaluate(self.model, dataloaders, max_batches, test_mode)
File "/home/.local/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 256, in _evaluate
for batch_idx, batch in enumerate(dataloader):
File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
File "/home/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 774, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
RuntimeError: DataLoader worker (pid(s) 24459) exited unexpectedly
This error message points indeed to the shared memory limit for multiprocessing, so you would need to increase it.
Hi, I got the same error,
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 480, in _try_get_batch
data = self.data_queue.get(timeout=timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/queues.py", line 104, in get
if not self._poll(timeout):
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 414, in _poll
r = wait([self], timeout)
File "/home/miniconda/lib/python3.6/multiprocessing/connection.py", line 911, in wait
ready = selector.select(timeout)
File "/home/miniconda/lib/python3.6/selectors.py", line 376, in select
fd_event_list = self._poll.poll(timeout)
File "/home/miniconda/lib/python3.6/site-packages/torch/utils/data/_utils/signal_handling.py", line 65, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 95106) is killed by signal: Segmentation fault.
what is interest is
when I just parse the data line by line, I do not have this issue:
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
but if I add a JSON parse logic after read line by line , it will report this error
with open(current_file, mode='rb') as f:
text = f.read().decode('utf-8')
all_data.extend(text.split('\n'))
json_data = []
for line in all_data:
try:
json_data.append(json.loads(line))
except:
break
return json_data
any clue why it happened if we have JSON parse? I think it is pretty standard process, not specific to pytorch
I searched around and all point to shared memory, however, I am not fully convinced my shm has problem, I tested my shm in docker it is
kernel.shmmax = 18446744073692774399
And also in my local test, even with a very small data sample, 100 lines JSON and worker=2, it still has problem. so wonder what could be issue
I met the same issue in pytorch 1.5.1, and set a new value in /etc/sysctl.conf
is not work(the default value of kernel.shmmax
is enough = 18446744073692774399).
This time, I use df -h
command and found there is a disk named /dev/shm
( shm
seems like shared memory
, which value is 50% of machine’s memory. Then I remount it by:
mount -o size=yourMemorySize -o nr_inodes=1000000 -o noatime, nodiratime -o remount /dev/shm
This problem is fixed.
By the way, the os I used is CenOS7.
Hi @ptrblck ,
is it possible to have a description that is more precise of what is causing the error to ease debugging? Is it caused by:
- my code
- the RAM
- the GPU memory
- filesystem memory/storage
I find that if it’s a recurring issue it’s often useful to state the source of the problem to more effectively fix it.
Hi. I’m suffering the same issue…
Platform info:
- ubuntu 18.04
- pytorch 1.8.1
- cuda version 11.0
- nvidia driver version 450.102.04
- NO DOCKER, I use conda environment and (luckily) the only user of this server.
System info:
- Physical memory: 256GB
- Shared memory: 1/2 of physical memory ~= 126GB
- GPU: TITAN V(VRAM: 12GB) * 8ea
Training info:
- Model: Quantization-aware training applied(library: brevitas) ResNet18
- Dataset: I tried CIFAR10, CIFAR100, ImageNet, but there’s no error occured with CIFAR10/100. maybe relevant to large dataset.
- Distribution: Pytorch DDP
Symptoms:
Suddenly only one GPU are released from multiprocessing(memory usage goes 0 and dedicated process are disappeared) and other GPUs locked in semaphore with 100% GPU utils.
I tried num_workers=8, 32, 64 cases but got error when 32 and 64.
I know that more workers consumes more shared memory, but in my cases, It only consumes <1% shared memory. (I monitored it with df -h option)
num_workers=64 is 3.5x faster than 8 cases so I don’t want to give up this option.
Is there any known bugs for num_workers?
Hi @FruitVinegar, have you found a solution for this problem? I have the same setup as you and the same problem.
I’m not sure but in my case there were one of DRAM slot problem and bad sector problem on the HDD and after solve 2 hardware issues, the symptom had gone…
Do you have solutions about this problem?
I have the same problem and I also do not use docker.
I am confused about why it still occurs when I do not use docker container.
same here, newest pytorch, ubuntu20, no docker involved:
kernel.shmmax = 18446744073692774399
I try to train in parallel on 2gpus.
Same here, after a few epochs the training crashes because not enough shared memory.
I train on 2 GPUs with a dataloader with 32 workers.
No docker involved, 94Go of /dev/shm which seems like a lot.
Same problem. Training on HPC, single node sbatch job (no docker), 4 gpus, crash during second epoch. Wouldn’t understand why it doesn’t crash on the first one but on during the second.
I also encountered same issues.
There is no problem when using 2 gpus, but problems occur from 3 or more.
The issue was resolved by adding the --ipc=host option when creating the container.
But I don’t understand how this problem relates to the shared memory of the container.
Could you please explain why do we need to increase the shared memory size?
I would guess that nn.DataParallel
might need additional shared memory, but am not sure as we generally recommend to use DistributedDataParallel
for better performance.
hello,have you solved this problem?
I ran the same code on other servers without this error
I have the same error with a conda
environment. kernel.shmmax=18446744073692774399
and if I watch df -h
I can see dev/shm
saturate and the code breaking. The pseudo-code that generates the error is:
dl = DataLoader(X, batch_size=batch_size, num_workers=16)
tmp = []
for i, batch in enumerate(dl):
tmp.append(batch.numpy())
[...]
If I change tmp
to an empty numpy array and iteratively populate it with the current batch
the shared memory does not saturate. The working pseudo code is:
dl = DataLoader(X, batch_size=batch_size, num_workers=16)
tmp = np.empty([desired_num_rows, X.shape[1], np.float32)
for i, batch in enumerate(dl):
t_from = i * batch.shape[0]
t_to = (i+1) * batch.shape[0]
tmp[t_from:t_to,:] = batch.numpy()
[...]
Alternativley, also this suggestion works.