Shm error in docker

When I run pytorch in docker, there is an error as follows. And I check the path of shm is fixed in the source code, “/torch_XXX”. So is there any way that I can change the shm path without compile the source code? Or just close the shm in python?

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Process Process-3:
Traceback (most recent call last):
** File “pytorch/main.py”, line 205, in **
** main()**
** File “pytorch/main.py”, line 116, in main**
** train(train_loader, model, criterion, optimizer, epoch)**
** File “pytorch/main.py”, line 136, in train**
** for i, (input, target) in enumerate(train_loader):**
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 286, in next**
** return self._process_next_batch(batch)**
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 307, in _process_next_batch**
** raise batch.exc_type(batch.exc_msg)**
RuntimeError: Traceback (most recent call last):
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 57, in _worker_loop**
** samples = collate_fn([dataset[i] for i in batch_indices])**
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in default_collate**
** return [default_collate(samples) for samples in transposed]**
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 138, in **
** return [default_collate(samples) for samples in transposed]**
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 113, in default_collate**
** storage = batch[0].storage()._new_shared(numel)**
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/storage.py”, line 114, in _new_shared**
** return cls._new_using_filename(size)**
RuntimeError: unable to write to file </torch_76_3625483894> at /pytorch/aten/src/TH/THAllocator.c:383

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
** File “/usr/local/anaconda3/lib/python3.6/multiprocessing/popen_fork.py”, line 35, in poll**
** pid, sts = os.waitpid(self.pid, flag)**
** File “/usr/local/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 178, in handler**
** _error_if_any_worker_fails()**
RuntimeError: DataLoader worker (pid 87) is killed by signal: Bus error.

1 Like

Could you try to start your docker container with --ipc=host?
From the github doc:

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

1 Like

For the security, it is not allowed to add “–ipc=host” in our company.
So if I can change the path of shm from /torch_XXX to /cahe in pytorch without source code?
The path is fixed in the source as follows:

#ifndef THC_GENERIC_FILE
// TODO: move this somewhere - we only need one version
static std::string THPStorage_(_newHandle)() {
static std::random_device rd;
std::string handle = "/torch
";
#ifdef MSC_VER
handle += std::to_string(GetCurrentProcessId());
#else
handle += std::to_string(getpid());
#endif
handle += "
";
handle += std::to_string(rd());
return handle;
}

What about setting --shm-size? Would that be allowed in your company?

2 Likes

Yeah, but the size can not be too large :joy:
Maybe it is better if I can change the path.

Hey, could you find solution to it?