Distributed training on slurm cluster

Sorry for the naive question but I am confused about the integration of distributed training in a slurm cluster. Do we need to explicitly call the distributed.launch when invoking the python script or is this taken care of automatically?

In other words, is this script correct?

#!/bin/bash
#SBATCH -p <dummy_name>
#SBATCH --time=12:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:Tesla-V100-32GB:4
#SBATCH --cpus-per-task=2
#SBATCH --mem=60G
#SBATCH --job-name=knee_eval_ad_ax0
#SBATCH --output=slurm.out

eval "$(conda shell.bash hook)"
conda activate pytorch

python -m torch.distributed.launch --nproc_per_node=4 main.py 

Your help would be highly appreciated.

Thanks,
Chinmay

2 Likes

The script looks fine, but you might want to replace the launch command with torchrun as the former is (or will be) deprecated. Are you able to check the GPU utilization on this node and could you check if all devices are used?

2 Likes

Thank you so much for your reply. I do not have the permission to log in to individual nodes and thus, I can not see utilization. Although, I can see that the batch size for individual node is getting divided by the number of GPUs(4).

That’s a good point. Additionally you could add a few print statements and make sure the data is pushed to all devices and the corresponding parameters are on the same device.

1 Like

Just to make sure that I understood it correctly. It is not sufficient to have the slurm parameters or torchrun separately. We need to provide both of them for things to work.

I’m not a slurm expert and think it could be possible to let slurm handle the distributed run somehow.
However, I’m using slurm to setup the node and let PyTorch handle the actual DDP launch (which seems to also be your use case). Let’s wait if some slurm experts might give you more ideas.

2 Likes

Hi, @ptrblck can you please share more details on how you use slurm to setup the node?

There is an excellent tutorial on distributed training with pytorch, under SLURM, from Princeton, here.. This is my submission job script, with containers utilizing singularity

#!/bin/bash

#SBATCH --job-name=COOL_JOB_NAME    # create a short name for your job
#SBATCH --nodes=20
#SBATCH --ntasks-per-node=4      # total number of tasks per node
#SBATCH --gres=gpu:4            # number of gpus per node
#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=240GB               # total memory per node 
#SBATCH --time=23:59:00          # total run time limit (HH:MM:SS)




##### Number of total processes 
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "
echo "Nodelist:= " $SLURM_JOB_NODELIST
echo "Number of nodes:= " $SLURM_JOB_NUM_NODES
echo "Ntasks per node:= "  $SLURM_NTASKS_PER_NODE
echo "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX "

# If you want to load things from your .bashrc profile, e.g. cuda drivers, singularity etc 
source ~/.bashrc


# ******************* These are read internally it seems ***********************************
# ******** Master port, address and world size MUST be passed as variables for DDP to work 
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))
echo "MASTER_PORT"=$MASTER_PORT
echo "WORLD_SIZE="$WORLD_SIZE

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
echo "MASTER_ADDR="$MASTER_ADDR
# ******************************************************************************************

# zoom zoom - recommended from lightning
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=2
export NCCL_MIN_CHANNELS=32


echo "Run started at:- "
date

# Actual run of script 
#srun python main.py # Use this if you have python in your environment
srun singularity exec --nv /Location/Of/Your/Containers/pytorch_22.05-py3.sif python main.py
echo ""
echo "################################################################"
echo "@@@@@@@@@@@@@@@@@@ Run completed at:- @@@@@@@@@@@@@@@@@@@@@@@@@"
date
echo "@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@"
echo "################################################################"

4 Likes

Hi folks, I’ve been fighting with this for the past couple of days and I’d like to get some advice.

So the initial question is running torchrun while SLURM is scheduling the job (so within a sbatch script). Here is an old example of this working: https://github.com/pytorch/examples/blob/main/distributed/ddp-tutorial-series/slurm/sbatch_run.sh#L17-L23. Please note this requires modifying the rendezvous endpoint etc. I think setting these using SLURM env vars is key so torchrun knows what resources it has. I was able to run this on a single node, however for two nodes my job would just hang at the “rendezvous’ing worker group” part so wouldn’t even initialize. The code example in the link didn’t work but it is rather old.

For more context, I am able to run without torchrun for multi-node-pytroch SLURM scheduled jobs (as the previous excellent comment suggested) but this isn’t ideal as it would require more code modification. However, I do need to go through a container layer so these commands are exactly what I needed. The link above doesn’t need to go through a container, like singularity, so perhaps it will not work with torchrun. Thoughts?

Any help on setting up the rendezvous args or a better explanation than the docs would be greatly appreciated! I’ll provide code snippets/outputs when I hear from someone and I’m by my computer :slight_smile:

1 Like

This is the latest submission script I use that works (handles rendezvous points etc with SLURM envs. You probably need to modify with loading .bashrc ENV variables or modules in your HPC environment.

#!/bin/bash
#SBATCH --job-name=InPntLRE-4      # create a short name for your job
#SBATCH --nodes=3              # node count
#SBATCH --gres=gpu:4            # number of gpus per node
#SBATCH --ntasks-per-node=1      # total number of tasks per node
#SBATCH --cpus-per-task=32        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=250G               # total memory per node (4 GB per cpu-core is default)
#SBATCH --time=23:59:00          # total run time limit (HH:MM:SS)

# zoom zoom 
export NCCL_NSOCKS_PERTHREAD=4
export NCCL_SOCKET_NTHREADS=2
export NCCL_MIN_CHANNELS=32

export RDZV_HOST=$(hostname)
export RDZV_PORT=29400

echo "Running on host:: $RDZV_HOST"
module list

srun apptainer exec --nv /path/to/container/container.sif torchrun \
    --nnodes=$SLURM_JOB_NUM_NODES \
    --nproc_per_node=4\
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint="$RDZV_HOST:$RDZV_PORT" \
    main.py

I am facing issue when I set --ntasks_per_node = 2 --gres=gpu:2, slurm is getting frozen at distributed init.

When --ntasks_per_node = 1, it is working but is slow. Any clue what I am missing?

Let’s say you have a system with 3 nodes and 2 gpus each node.

Set (slurm) ntasks-per-node=1, and gres=gpu:2 . In this case, all gpus are allocated on single task per node (for slurm), so torchrun uses as many processes per node as gpus:

srun torchrun –nproc_per_node=2 --nnodes=3 … # rest of params

The last command (that uses srun) is equivalent to you manually running torchrun command on each separate node. Basically torchrun handles the inter gpu communication and slurm spins out as many torchrun commands as nodes.

The above works for me, and performance is excellent (I use it for my research very frequently).

I have experimented in the past with ntasks-per-node= number of gpus, but in this case you need to set nproc_per_node=1 in the torchrun command to make it work (take it with a grain of salt, because this last step I’ve run few months back when experimenting). It is my understanding that in this case, torchrun is launched nnodes * ntasks-per-node times via slurm. I don’t remember what performance I got with this setup but it is not what I use (should be feasible though).

If the first option I gave runs slow in your system, you might have problem with your software environment. I use containers (NGC based) and they are working fine. Not all software environments / installations we tested were working. Many subtle differences can mess up the system. I recommend containers+singularity (apptainer) as a proof of concept and then move on to module load pytorch/ or local software installation in your system (your sysadmins might need to redo module installation or you to play around with different versions of local pytorch installation etc).

Hope the above helps.

Hey, Thank you for the detailed reply. I was trying to run VideoMAE as a baseline with multiple gpus and have single node with 4 GPU setup.
When I set n_tasks_per_node 1 as shown below

#!/bin/bash
#SBATCH --cpus-per-task=12      # CPU cores per task
#SBATCH --gres=gpu:3             # Number of allocated GPUs per node
#SBATCH --ntasks-per-node=1             # Number of tasks (processes)
#SBATCH --nodes=1                # Number of nodes
#SBATCH --job-name=videomaecoche     # Job name
#SBATCH --output=videomae_%j.log # Standard output and error log

srun torchrun --nproc_per_node=3 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=$MASTER_ADDR \
    --master_port=$MASTER_PORT \
    run_mae_pretraining.py \
    --data_path ${DATA_PATH} \
    --mask_type tube \
    --mask_ratio 0.9 \
    --model pretrain_videomae_base_patch16_224 \
    --decoder_depth 4 \
    --batch_size 36 \
    --num_frames 16 \
    --sampling_rate 4 \
    --opt adamw \
    --opt_betas 0.9 0.95 \
    --warmup_epochs 40 \
    --save_ckpt_freq 20 \
    --epochs 801 \
    --log_dir ${OUTPUT_DIR} \
    --output_dir ${OUTPUT_DIR}

The model runs without any problem. However, because the authors of VideoMAE initialise world_size = SLURM_NTASKS my world size becomes 1 as shown below.

args.rank = int(os.environ['SLURM_PROCID'])
 args.gpu = int(os.environ['SLURM_LOCALID'])
 args.world_size = int(os.environ['SLURM_NTASKS'])
 os.environ['RANK'] = str(args.rank)
 os.environ['LOCAL_RANK'] = str(args.gpu)
 os.environ['WORLD_SIZE'] = str(args.world_size)

Because the world_size becomes 1, the Distributed Sampler trains all the models in different GPU’s with the same data because num_replicas == num_tasks and I think the sampler_rank is also effected

num_tasks = utils.get_world_size()
global_rank = utils.get_rank()
sampler_rank = global_rank

total_batch_size = args.batch_size * num_tasks
num_training_steps_per_epoch = len(dataset_train) // total_batch_size

sampler_train = torch.utils.data.DistributedSampler(
    dataset_train, num_replicas=num_tasks, rank=sampler_rank, shuffle=True
)

Kindly guide me as I am new to distributed training and feel like I am missing something.

1 Like

When I set --ntasks-per-node=3, the world size becomes 3 as expected but the code is stuck as mentioned in the original post.
@Foivos_Diakogiannis

The python scripts you are using are set up to launch with python main.py something, not torchrun. With torchrun you have automatic the following environmental variables (see this) RANK, LOCAL_RANK, WORLD_SIZE.

Change this:

args.rank = int(os.environ['SLURM_PROCID']) # <=== Problem here 
 args.gpu = int(os.environ['SLURM_LOCALID']) # <=== Problem here 
 args.world_size = int(os.environ['SLURM_NTASKS']) # <==== Problem here 
 os.environ['RANK'] = str(args.rank)
 os.environ['LOCAL_RANK'] = str(args.gpu)
 os.environ['WORLD_SIZE'] = str(args.world_size)

to this:

args.rank = int(os.environ['RANK'])
 args.gpu = int(os.environ['LOCAL_RANK'])
 args.world_size = int(os.environ['WORLD_SIZE'])
 os.environ['RANK'] = str(args.rank)
 os.environ['LOCAL_RANK'] = str(args.gpu)
 os.environ['WORLD_SIZE'] = str(args.world_size)

This is because torchrun takes over slurm on some tasks, per node.

A nice tutorial to learn distributed WITHOUT torchrun, using slurm, is this from Princeton, that I have already put in previous post.

Regards,
F.