Hello
I am facing this error while trying to run my code
Traceback (most recent call last):
File “/wrk1/Salwa_Directory/KhairyCode/find-goal/train.py”, line 72, in
File “/wrk1/Salwa_Directory/KhairyCode/find-goal/util/shared_opt.py”, line 42, in init
File “/wrk1/Salwa_Directory/KhairyCode/find-goal/util/shared_opt.py”, line 49, in share_memory
File "/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/tensor.py", line 515, in share_memory
File “/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/storage.py”, line 599, in share_memory_
File “/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/storage.py”, line 195, in share_memory_
RuntimeError: unable to open shared memory object </torch_3282906_2794037818_1009> in read-write mode: Too many open files (24)
First:
I checked the shared memory size. It was
$ cat /proc/sys/kernel/shmmni
4096
I am working on the university server so I don’t have access to increase the shared memory.
$ ulimit -n 16384
bash: ulimit: open files: cannot modify limit: Operation not permitted
Second
I tried to change the sharing strategy
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy(‘file_system’)
but it raises the following error instead
File “/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 689, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Can you please help me out?