SLURM srun vs torchrun: Difference numbers of spawn processes

tallgeese · April 17, 2024, 7:37am

Hi,

For a situation of two computing nodes having 4 GPUs each, I could not reconcile the difference in number of spawn processes resulting from a directly launch with srun and delegated one with torchrun

Direct launch with SLURM’s srun

import torch.distributed as dist

WORLD_SIZE = int(os.environ['SLURM_NTASKS'])
WORLD_RANK = int(os.environ['SLURM_PROCID'])
LOCAL_RANK =int(os.environ['SLURM_LOCALID'])
  
dist.init_process_group('nccl', rank=WORLD_RANK, world_size=WORLD_SIZE)

device = torch.device("cuda:{}".format(LOCAL_RANK))

Here, 2 x 4 = 8 processes are required in total, i.e.

$ srun -n 8 python test.py

Delegated launch via torchrun

According to manual, WORLD_SIZE, RANK, and LOCAL_RANK are automatically populated by torchrun.

import torch.distributed as dist

dist.init_process_group('nccl', init_method="env://")

device = torch.device("cuda:{}".format(os.environ['LOCAL_RANK']))

To launch torchrun via SLURM scheduler, I need to use a wrapper to asign --node-rank correctly, i.e.

$ srun -n 2 wrapper.sh

With the content of wrapper.sh as follow:

torchrun \
    --nnodes=$SLURM_NNODES \
    --node_rank=$SLURM_NODEID \
    --nproc_per_node=$SLURM_GPUS_ON_NODE \
     test.py

Thus:

(1) spawn 8 tasks / 8 cores → 1 task per core
(2) spawn 8 tasks / 8 cores → 4 tasks per core

The outputs are same, yet the allocated resources are difference.
I am somewhat confused regarding the performance implication.
(1) is probably better due to the absence of oversubscription.
In other words, there is no point of using torchlaunch with fixed resource allocation via SLURM.

I appreciate your insights on this matter.
Thanks.

biubiu · January 27, 2025, 7:41am

did you figure out the reason? I ran into a similar issue: 2 nodes and 4GPUs per node. I started torchrun on each node, everything works fine (when I check CPU usage of each python process, they are around 100%~200%). But if I use slurm (18.08) with srun, the 4 python processes have a net CPU usage 100% (about 25% each). Here is the submitting script.

#!/bin/bash
#SBATCH --job-name=“test”
#SBATCH -p doppler
#SBATCH --cpus-per-task=40
#SBATCH --gpus-per-task=4
#SBATCH --nodes=2
#SBATCH --ntasks=2

Set the Env. for torch DDP

export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_GPUS_PER_TASK))
master_addr=$(scontrol show hostnames “$SLURM_JOB_NODELIST” | head -n 1)
export MASTER_ADDR=$master_addr
export OMP_NUM_THREADS=4

srun --chdir=/mylocaldir --export=ALL
torchrun
–nnodes=$SLURM_JOB_NUM_NODES
–nproc_per_node=4
–rdzv_id=$SLURM_JOB_ID
–rdzv_backend=c10d
–rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT
main.py

Any suggestions will be appreciated!

tallgeese · February 7, 2025, 6:32am

Hi @biubiu

Sorry for belated reply.
It has been a long time since the original question.

My understanding is that I cannot launch torchrun directly via srun, i.e.

srun torchrun --

This is because the $SLURM_NODEID is not propagated correctly to torchrun.
I had to use wrapper.sh where torchrun will inherit $SLURM_NODEID from the parent wrapper.sh

Could you try launching through a wrapper ?

Since I don’t need to worry about reliability in case one of the node is down, I opts to explicitly initialize DDP inside the training script nowadays.

Hope it help.