RuntimeError: unable to open shared memory object (depending on the model)

ClementPla · March 26, 2021, 2:53pm

Hi,

I’m facing an issue I’ve seen covered in others topics but I haven’t found any working solutions.
I’m trying to use my library to train different models on a distributed way on a server containing 4 GPUs.
For most models, it works as expected (DeepLabV3+, ResNet101, UNeT, Transformers…). There is one model though that does not work which is the HRN for Semantic Segmentation,
following the original implementation
I did minor modification to this model to remove anything that wasn’t Pytorch/pure python objects. I also removed any class attributes (not sure if this helps). My re-implementation is here.

On a single GPU, the model works fine, but on multi-gpu DDP, I get the following message:

File "/usagers/clpla/.conda/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage
RuntimeError: unable to open shared memory object </torch_403534_2756454997> in read-write mode

This error raises when calling the torch.multiprocessing.spawn function in my code, so before the dataloaders are even created. I’ve still tried varying the num_workers (even set to 0), with no effects.
Following other recommendations, I’ve tried increasing the shared memory size:

# cat /proc/sys/kernel/shmmni
8192

Which did not helped neither. Finally, I tried changing the sharing strategy to file_system, as described in the pytorch multiprocessing documentation
In this case, the previous error does not appeared, but it leads to memory leaks (in regular RAM) after a few epochs and I suspect it comes from the dataloader. It also does not explain why the file_descriptor strategy would work on certain models and not others.

I’m running out of idea, I’m open to any suggestions. Let me know if more details are needed. I could try to provide a minimal (non)-working example.

Specifications:

OS : CentOS 7
Python 3.8.8
Pytorch: 1.8.0
4 GPUs RTX 2080
CUDA: 11.0
Driver: 450.66
RAM: 125G

Thanks!

wayi · March 27, 2021, 6:21am

To summarize, you have tried 3 approaches (as also suggested in this thread):

Set num_workers=0 (i.e., self.config['Manager']['num_workers']=0) when calling DataLoader constructor;
Increase shared memory size;
Change the sharing strategy:

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

Can you try a few more suggestions in possible deadlock in dataloader · Issue #1355 · pytorch/pytorch · GitHub?

It seems that someone also tried ulimit -n or setNumThreads(0). If all the suggestions are exhausted, feel free to open a bug on Github.

ClementPla · March 31, 2021, 8:17pm

Hi!
Thanks for your reply, you perfectly summarized my approaches.
And thanks to your suggestion, my model finally works!
As you suggested, I had to call:

ulimit -n 64000

The 64000 is kinda arbitrary, it corresponds to the open files limit. I’m not really sure how this is is related to distributed training and what a good value should be. The default value of 1024 was enough for some models but not for other.

Anyway, some relevant infos if someones ends up with the same issue:

You need admin rights to call ulimit. Note that calling ulimit will only affect the current shell. Values will be resetted at reboot.
Thanks again.

tianhan4 · January 8, 2022, 5:35am

Great, it also helps me. Thank you!