I’m facing an issue I’ve seen covered in others topics but I haven’t found any working solutions.
I’m trying to use my library to train different models on a distributed way on a server containing 4 GPUs.
For most models, it works as expected (DeepLabV3+, ResNet101, UNeT, Transformers…). There is one model though that does not work which is the HRN for Semantic Segmentation,
following the original implementation
I did minor modification to this model to remove anything that wasn’t Pytorch/pure python objects. I also removed any class attributes (not sure if this helps). My re-implementation is here.
On a single GPU, the model works fine, but on multi-gpu DDP, I get the following message:
File "/usagers/clpla/.conda/envs/torch18/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 321, in reduce_storage RuntimeError: unable to open shared memory object </torch_403534_2756454997> in read-write mode
This error raises when calling the torch.multiprocessing.spawn function in my code, so before the dataloaders are even created. I’ve still tried varying the num_workers (even set to 0), with no effects.
Following other recommendations, I’ve tried increasing the shared memory size:
# cat /proc/sys/kernel/shmmni 8192
Which did not helped neither. Finally, I tried changing the sharing strategy to file_system, as described in the pytorch multiprocessing documentation
In this case, the previous error does not appeared, but it leads to memory leaks (in regular RAM) after a few epochs and I suspect it comes from the dataloader. It also does not explain why the file_descriptor strategy would work on certain models and not others.
I’m running out of idea, I’m open to any suggestions. Let me know if more details are needed. I could try to provide a minimal (non)-working example.
- OS : CentOS 7
- Python 3.8.8
- Pytorch: 1.8.0
- 4 GPUs RTX 2080
- CUDA: 11.0
- Driver: 450.66
- RAM: 125G