SLURM srun vs torchrun: Difference numbers of spawn processes

Hi,

For a situation of two computing nodes having 4 GPUs each, I could not reconcile the difference in number of spawn processes resulting from a directly launch with srun and delegated one with torchrun

  1. Direct launch with SLURM’s srun
import torch.distributed as dist

WORLD_SIZE = int(os.environ['SLURM_NTASKS'])
WORLD_RANK = int(os.environ['SLURM_PROCID'])
LOCAL_RANK =int(os.environ['SLURM_LOCALID'])
  
dist.init_process_group('nccl', rank=WORLD_RANK, world_size=WORLD_SIZE)

device = torch.device("cuda:{}".format(LOCAL_RANK))

Here, 2 x 4 = 8 processes are required in total, i.e.

$ srun -n 8 python test.py
  1. Delegated launch via torchrun

According to manual, WORLD_SIZE, RANK, and LOCAL_RANK are automatically populated by torchrun.

import torch.distributed as dist

dist.init_process_group('nccl', init_method="env://")

device = torch.device("cuda:{}".format(os.environ['LOCAL_RANK']))

To launch torchrun via SLURM scheduler, I need to use a wrapper to asign --node-rank correctly, i.e.

$ srun -n 2 wrapper.sh 

With the content of wrapper.sh as follow:

torchrun \
    --nnodes=$SLURM_NNODES \
    --node_rank=$SLURM_NODEID \
    --nproc_per_node=$SLURM_GPUS_ON_NODE \
     test.py

Thus:

  • (1) spawn 8 tasks / 8 cores → 1 task per core
  • (2) spawn 8 tasks / 8 cores → 4 tasks per core

The outputs are same, yet the allocated resources are difference.
I am somewhat confused regarding the performance implication.
(1) is probably better due to the absence of oversubscription.
In other words, there is no point of using torchlaunch with fixed resource allocation via SLURM.

I appreciate your insights on this matter.
Thanks.