Hi,
For a situation of two computing nodes having 4 GPUs each, I could not reconcile the difference in number of spawn processes resulting from a directly launch with srun
and delegated one with torchrun
- Direct launch with SLURM’s
srun
import torch.distributed as dist
WORLD_SIZE = int(os.environ['SLURM_NTASKS'])
WORLD_RANK = int(os.environ['SLURM_PROCID'])
LOCAL_RANK =int(os.environ['SLURM_LOCALID'])
dist.init_process_group('nccl', rank=WORLD_RANK, world_size=WORLD_SIZE)
device = torch.device("cuda:{}".format(LOCAL_RANK))
Here, 2 x 4 = 8 processes are required in total, i.e.
$ srun -n 8 python test.py
- Delegated launch via
torchrun
According to manual, WORLD_SIZE
, RANK
, and LOCAL_RANK
are automatically populated by torchrun
.
import torch.distributed as dist
dist.init_process_group('nccl', init_method="env://")
device = torch.device("cuda:{}".format(os.environ['LOCAL_RANK']))
To launch torchrun
via SLURM scheduler, I need to use a wrapper to asign --node-rank
correctly, i.e.
$ srun -n 2 wrapper.sh
With the content of wrapper.sh
as follow:
torchrun \
--nnodes=$SLURM_NNODES \
--node_rank=$SLURM_NODEID \
--nproc_per_node=$SLURM_GPUS_ON_NODE \
test.py
Thus:
- (1) spawn 8 tasks / 8 cores → 1 task per core
- (2) spawn 8 tasks / 8 cores → 4 tasks per core
The outputs are same, yet the allocated resources are difference.
I am somewhat confused regarding the performance implication.
(1) is probably better due to the absence of oversubscription.
In other words, there is no point of using torchlaunch
with fixed resource allocation via SLURM.
I appreciate your insights on this matter.
Thanks.