RuntimeError: unable to open shared memory object

hongtaesuk · August 8, 2018, 7:13am

Hey guys, I having trouble with loading indexes of document.
I am testing my code, so I set

batch_size = 4
number_of_sentences_in_document = 84
number_of_words_in_sentence = 80

that sums up one mini_batch with
80 * 84 * 4 indexes of documents.

The problem is that when I transform those indexes dataset into a DataLoader as below
and try to loop over trainloader, it results out so many error messages.

DataManager = DS.NewsDataset(data_examples_gen, Vocab)
trainloader = torch.utils.data.DataLoader(DataManager, batch_size=Args.args.batch_size, shuffle=True, num_workers=32)

The error messages are below.

Traceback (most recent call last):
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 61, in _worker_loop
data_queue.put((idx, samples))
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/queues.py”, line 341, in put
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/reduction.py”, line 51, in dumps
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py”, line 121, in reduce_storage
RuntimeError: unable to open shared memory object </torch_54163_3383444026> in read-write mode at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/TH/THAllocator.c:342

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/util.py”, line 262, in _run_finalizers
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/util.py”, line 186, in call
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/shutil.py”, line 476, in rmtree
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/shutil.py”, line 474, in rmtree
OSError: [Errno 24] Too many open files: ‘/tmp/pymp-sgew4xdn’
Process Process-1:
Traceback (most recent call last):
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/process.py”, line 93, in run
self._target(*self._args, **self._kwargs)
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 61, in _worker_loop
data_queue.put((idx, samples))
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/queues.py”, line 341, in put
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/reduction.py”, line 51, in dumps
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/site-packages/torch/multiprocessing/reductions.py”, line 121, in reduce_storage
RuntimeError: unable to open shared memory object </torch_54163_3383444026> in read-write mode at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/TH/THAllocator.c:342
Traceback (most recent call last):
File “/home/nlpgpu3/LinoHong/FakeNewsByTitle/main.py”, line 25, in
for mini_batch in trainloader :
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 280, in next
idx, batch = self._get_batch()
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 259, in _get_batch
return self.data_queue.get()
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/queues.py”, line 335, in get
res = self._reader.recv_bytes()
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/connection.py”, line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/connection.py”, line 407, in _recv_bytes
buf = self._recv(4)
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/multiprocessing/connection.py”, line 379, in _recv
chunk = read(handle, remaining)
File “/home/nlpgpu3/anaconda3/envs/linohong3/lib/python3.6/site-packages/torch/utils/data/dataloader.py”, line 178, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 54163) exited unexpectedly with exit code 1.

Process finished with exit code 1

It has error kinds

RuntimeError: unable to open shared memory object </torch_54163_3383444026> in read-write mode at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/TH/THAllocator.c:342

OSError: [Errno 24] Too many open files: ‘/tmp/pymp-sgew4xdn’

RuntimeError: DataLoader worker (pid 54163) exited unexpectedly with exit code 1.

I thought this is somekind of memory problem so I tried the same thing
only having two sentences for a document and it worked.
However, I am expecting this to get pretty much larger with
batch_size up to 32 or 64,
the number of sentences per document up to 84
the number of words per sentence up to 84.
How can I manage this problem?
Any Idea???

ptrblck · August 8, 2018, 11:03am

Could you check the suggestions in this issue?
It looks like your shared memory is too low.

Does your code run with less num_workers or num_workers=0?

hongtaesuk · August 13, 2018, 2:59am

WOW!! thanks!!
it worked when I set num_workers equals to 0.
doesn’t work greater or equal to 1.

But, doesn’t that mean that if I set it equals to 0 then
I’m not using any GPUs??

ptrblck · August 13, 2018, 8:03am

No, num_workers=0 means your data is loaded in the main process. You can still use the GPU although your code might be slower. Could you try to increase the shared memory and try setting num_workers>0 again?

Maxwell_Albert · March 5, 2022, 10:20am

I met the same problem, and once I set num_workers up to more than one, I randomly cause this problem. Could you give me any help?

ptrblck · March 5, 2022, 7:46pm

The error usually means that your system doesn’t provide enough shared memory for multiple workers (used via num_workers>0). Check the shared memory limitation of your system and try to increase it.

Maxwell_Albert · March 8, 2022, 2:33pm

Thank you！
But I don’t regard this as the reason. Actually it is about pytorch version. When I use 1.8.0 and nightly, it works well. Hoever when I use 1.10.2(which is latest stable version), it always have this problem.
And I test this for changing sharing memory parttern to ‘file system’. And I also enlarge shared memory.
Some one has created an issue on github. And they fixed it at nightly pyotch

ptrblck · March 8, 2022, 7:40pm

Ah, interesting. Could you post the GitHub issue here for visibility?

Maxwell_Albert · March 9, 2022, 12:56pm

here is the fix in the nightly pytorch. And I have tested it.
In this page there is a link for decription for this problem