RuntimeError: unable to open shared memory object (depending on the model)


I’m facing an issue I’ve seen covered in others topics but I haven’t found any working solutions.
I’m trying to use my library to train different models on a distributed way on a server containing 4 GPUs.
For most models, it works as expected (DeepLabV3+, ResNet101, UNeT, Transformers…). There is one model though that does not work which is the HRN for Semantic Segmentation,
following the original implementation
I did minor modification to this model to remove anything that wasn’t Pytorch/pure python objects. I also removed any class attributes (not sure if this helps). My re-implementation is here.

On a single GPU, the model works fine, but on multi-gpu DDP, I get the following message:

File "/usagers/clpla/.conda/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_403534_2756454997> in read-write mode

This error raises when calling the torch.multiprocessing.spawn function in my code, so before the dataloaders are even created. I’ve still tried varying the num_workers (even set to 0), with no effects.
Following other recommendations, I’ve tried increasing the shared memory size:

# cat /proc/sys/kernel/shmmni

Which did not helped neither. Finally, I tried changing the sharing strategy to file_system, as described in the pytorch multiprocessing documentation
In this case, the previous error does not appeared, but it leads to memory leaks (in regular RAM) after a few epochs and I suspect it comes from the dataloader. It also does not explain why the file_descriptor strategy would work on certain models and not others.

I’m running out of idea, I’m open to any suggestions. Let me know if more details are needed. I could try to provide a minimal (non)-working example.


  • OS : CentOS 7
  • Python 3.8.8
  • Pytorch: 1.8.0
  • 4 GPUs RTX 2080
  • CUDA: 11.0
  • Driver: 450.66
  • RAM: 125G


To summarize, you have tried 3 approaches (as also suggested in this thread):

  1. Set num_workers=0 (i.e., self.config['Manager']['num_workers']=0) when calling DataLoader constructor;
  2. Increase shared memory size;
  3. Change the sharing strategy:
import torch.multiprocessing

Can you try a few more suggestions in possible deadlock in dataloader · Issue #1355 · pytorch/pytorch · GitHub?

It seems that someone also tried ulimit -n or setNumThreads(0). If all the suggestions are exhausted, feel free to open a bug on Github.

1 Like

Thanks for your reply, you perfectly summarized my approaches.
And thanks to your suggestion, my model finally works!
As you suggested, I had to call:

ulimit -n 64000

The 64000 is kinda arbitrary, it corresponds to the open files limit. I’m not really sure how this is is related to distributed training and what a good value should be. The default value of 1024 was enough for some models but not for other.

Anyway, some relevant infos if someones ends up with the same issue:

You need admin rights to call ulimit. Note that calling ulimit will only affect the current shell. Values will be resetted at reboot.
Thanks again.

Great, it also helps me. Thank you!