Distributed Training PyTorch Gloo [socket: Too many open files]

msalvaris · September 18, 2018, 11:38am

Hi,

I am trying to get a distributed training to work across multiple nodes using gloo backend. I am able to get everything working on a single node with 4 GPUs but once I go across nodes I get the following error and the processes hang:

Exception in thread Thread-46:Traceback (most recent call last): File "/opt/conda/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/opt/conda/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 476, in _reduction_thread_fn _process_batch() # just to have a clear scope File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 469, in _process_batch dist.all_reduce(coalesced, group=group_id) File "/opt/conda/lib/python3.6/site-packages/torch/distributed/init.py", line 324, in all_reduce return torch._C._dist_all_reduce(tensor, op, group)RuntimeError: [/opt/conda/conda-bld/pytorch_1524586445097/work/third_party/gloo/gloo/transport/tcp/pair.cc:142] socket: Too many open files

I have searched around and the only references I find are to issues with the data loader. In this example I am using synthetic data so that can’t be it. Any help would be much appreciated.

Thanks

ztichigo · July 29, 2021, 11:08am

I also faced the same issue with pytorch gloo backend. My code works without this error after I reset the max number of socket in my linux like ulimit -n 65535, usually the defualt value is 1024.