Multi-node computation using DistributedDataParallel, getting a permission denied on `dist.init_process_group()` method

mesllo · April 3, 2022, 7:56pm

I’ve been trying to follow this tutorial for multi-node computation using SLURM but I have not succeeded yet.

I’m trying to implement this on a University supercomputer where I’m logging in via ssh using port 22. When I set MASTER_PORT=12340 or some other number on the SLURM script, I get no response since I assume that there’s nothing happening on this port. This may be a naive point to make, but I thought that maybe I have to set the MASTER_PORT to 22 instead. When I do this, I get a permission denied when the code reaches the dist.init_process_group() method, specifically:

Traceback (most recent call last):
  File "train_dist.py", line 262, in <module>
    main()
  File "train_dist.py", line 220, in main
    world_size=opt.world_size, rank=opt.rank)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/home/miniconda3/envs/vit/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:22 (errno: 13 - Permission denied). The server socket has failed to bind to 0.0.0.0:22 (errno: 13 - Permission denied).

In this method, have set world_size=4, rank=0, dist_backend='nccl', and dist_url='env://' set, but I’m not sure if these are contributing to the problem.

What I have also tried to do is rerouting the port 22 traffic to some other port (eg. 65000) but I also get permission denied for even attempting this rerouting. I’m not sure what else I can try to do at this point, anyone has any suggestions?

H-Huang · April 4, 2022, 6:10pm

Hello, can you clarify what you mean by “you get no responses” from port 12340? Does it just hang?

I would suggest not to use port 22 as it is a reserved port and your current port at 12340 should be fine.
A couple of suggestions would be:

Try adding print statements before and after dist.init_process_group(...) you can add something like print(f"RANK {args.rank}: before init_process_group") and do the same for after to see which ranks are actually calling into it and finishing.
Try a different backend, setting dist_backend='gloo'. See if there is any difference.
Make sure there are no firewall rules blocking access to 12340, or try a different (nonreserved) port

mesllo · April 4, 2022, 6:44pm

Hi, thanks for your reply.

So I ssh into this supercomputer I’m using for University research through port 22.

I schedule the below SLURM script:

#!/bin/bash
#SBATCH --time=00:15:00
#SBATCH -N 2
#SBATCH -C TitanX
#SBATCH --gres=gpu:1
#SBATCH -o myfile.out
#SBATCH --ntasks-per-node=1 # number of tasks per node (1 task per GPU in a single node)

# Load GPU drivers
module load cuda11.1/toolkit
module load cuDNN/cuda11.1

# change 5-digit MASTER_PORT as you wish, slurm will raise Error if duplicated with others
export MASTER_PORT=12340
# WORLD_SIZE as gpus/node * num_nodes
export WORLD_SIZE=4

### get the first node name as master address - customized for vgg slurm
### e.g. master(gnodee[2-5],gnoded1) == gnodee2
echo "NODELIST="${SLURM_NODELIST}
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR

# This loads the anaconda virtual environment with our packages
source /home/jbo480/.bashrc
conda activate viton_37

echo "checkpoint"

# Run the actual experiment
python train_dist.py --name gmm_train --stage GMM --workers 4 --save_count 5000 --shuffle --data_list train_pairs.txt --keep_step 100000 --decay_step 100000

Now if I change the MASTER_PORT to 22, the python command executes and I run into the init_process_group() error. If I change MASTER_PORT to 12340, it seems that everything runs fine until “checkpoint”. But then the Python script doesn’t seem to run since I do not get any output from the Python script’s print statements.

hendrikl · June 28, 2022, 5:49pm

Hi, just wanted to say that I have experienced the same issue. Did you find a solution yet? @mesllo

mesllo · July 6, 2022, 1:43pm

Since this answer, many things changed and I’ve given up on multi-node computing for now, there is no solid consensus or explanation as to how it should work I feel, and it seems that every answer I see relies on hit-and-miss implementation. The Pytorch docs still don’t seem detailed enough for me to understand what I’m doing wrong, but maybe that’s just me.

hendrikl · July 6, 2022, 6:38pm

I managed to make it work, it turns out that it does not matter what MASTER_PORT in your pytorch script is. As far as I understand you can set MASTER_PORT to any free >4-digit number. Using the environment variable approach from your tutorial did not work for me, but a different approach from this example did work. The main difference is that you use
dist.init_process_group(init_method=args.init_method, ...) where args.init_method = tcp://$MASTER_ADDR:3456.
Let me know if this works for you.