I’ve been trying to follow this tutorial for multi-node computation using SLURM but I have not succeeded yet.
I’m trying to implement this on a University supercomputer where I’m logging in via ssh using port 22. When I set
MASTER_PORT=12340 or some other number on the SLURM script, I get no response since I assume that there’s nothing happening on this port. This may be a naive point to make, but I thought that maybe I have to set the
22 instead. When I do this, I get a permission denied when the code reaches the
dist.init_process_group() method, specifically:
Traceback (most recent call last): File "train_dist.py", line 262, in <module> main() File "train_dist.py", line 220, in main world_size=opt.world_size, rank=opt.rank) File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout) File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store hostname, port, world_size, start_daemon, timeout, multi_tenant=True RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:22 (errno: 13 - Permission denied). The server socket has failed to bind to 0.0.0.0:22 (errno: 13 - Permission denied).
In this method, have set
dist_url='env://' set, but I’m not sure if these are contributing to the problem.
What I have also tried to do is rerouting the port 22 traffic to some other port (eg. 65000) but I also get permission denied for even attempting this rerouting. I’m not sure what else I can try to do at this point, anyone has any suggestions?