RuntimeError: unable to open shared memory object </torch_3282906_2794037818_1009> in read-write mode: Too many open files (

Hello

I am facing this error while trying to run my code

Traceback (most recent call last):
File “/wrk1/Salwa_Directory/KhairyCode/find-goal/train.py”, line 72, in
File “/wrk1/Salwa_Directory/KhairyCode/find-goal/util/shared_opt.py”, line 42, in init
File “/wrk1/Salwa_Directory/KhairyCode/find-goal/util/shared_opt.py”, line 49, in share_memory
File "/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/tensor.py", line 515, in share_memory
File “/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/storage.py”, line 599, in share_memory_
File “/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/storage.py”, line 195, in share_memory_
RuntimeError: unable to open shared memory object </torch_3282906_2794037818_1009> in read-write mode: Too many open files (24)
First:
I checked the shared memory size. It was
$ cat /proc/sys/kernel/shmmni
4096
I am working on the university server so I don’t have access to increase the shared memory.

$ ulimit -n 16384
bash: ulimit: open files: cannot modify limit: Operation not permitted
Second
I tried to change the sharing strategy
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy(‘file_system’)

but it raises the following error instead

File “/homedir05/smostafa22/.local/lib/python3.9/site-packages/torch/nn/modules/module.py”, line 689, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

Can you please help me out?

If you cannot change the system settings your second approach sounds valid and the CUDA OOM error might be unrelated. Did you check if this workload can fit onto your GPU?