Torchrun not utilizing all nodes slurm invokes?

jsmidt · May 1, 2024, 10:58pm

Trying to keep things simple, I wanted to use this pytorch echo.py script here to get torchrun going. With the biggest difference being I added some slurm environment variables to track to give the following:

#!/usr/bin/env python3
import io
import os
import pprint
import sys
import torch
import torch.distributed as dist

dist.init_process_group(backend="nccl")

if __name__ == "__main__":

    env_dict = {
        k: os.environ[k]
        for k in (
            "LOCAL_RANK",
            "RANK",
            "GROUP_RANK",
            "WORLD_SIZE",
            "MASTER_ADDR",
            "MASTER_PORT",
            "SLURMD_NODENAME",
            "SLURM_PROCID",
            "SLURM_NODEID",
            "SLURM_JOB_NODELIST",
        )
    }

    with io.StringIO() as buff:
        print("======================================================", file=buff)
        print(
            f"Environment variables set by the agent on PID {os.getpid()}:", file=buff
        )
        pprint.pprint(env_dict, stream=buff)
        print("======================================================", file=buff)
        print(buff.getvalue())
        sys.stdout.flush()

    dist.barrier()

    print(
        (
            f"On PID {os.getpid()}, after init process group, "
            f"rank={dist.get_rank()}, world_size = {dist.get_world_size()}\n"
        )
    )

I have tried several slurm scripts, including the following:


#Submit this script with: sbatch filename
#SBATCH --time=0:20:00   # walltime
#SBATCH --nodes=2   # number of nodes
#SBATCH --ntasks-per-node=4   # number of tasks per node
#SBATCH --job-name=gpt2   # job name
#SBATCH --mem=0

# Training setup
# so processes know who to talk to
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
MASTER_PORT=6000
WORLD_SIZE=8

echo $SLURM_PROCID

srun torchrun --nnodes=2 --nproc-per-node=4 --max-restarts=3 --rdzv-id=$SLURM_JOB_ID --rdzv-backend=c10d --rdzv-endpoint=$MASTER_ADDR --node-rank $SLURM_PROCID  echo.py

I get this error, that suggests torchrun is mapping everything to only one of the two nodes:

======================================================
Environment variables set by the agent on PID 94025:
{'GROUP_RANK': '1',
 'LOCAL_RANK': '3',
 'MASTER_ADDR': 'nid001480',
 'MASTER_PORT': '36327',
 'RANK': '7',
 'SLURMD_NODENAME': 'nid001480',
 'SLURM_JOB_NODELIST': 'nid[001480,001489]',
 'SLURM_NODEID': '0',
 'SLURM_PROCID': '3',
 'WORLD_SIZE': '8'}
======================================================
======================================================
Environment variables set by the agent on PID 94022:
{'GROUP_RANK': '0',
 'LOCAL_RANK': '1',
 'MASTER_ADDR': 'nid001480',
 'MASTER_PORT': '36327',
 'RANK': '1',
 'SLURMD_NODENAME': 'nid001480',
 'SLURM_JOB_NODELIST': 'nid[001480,001489]',
 'SLURM_NODEID': '0',
 'SLURM_PROCID': '2',
 'WORLD_SIZE': '8'}
======================================================


[rank1]: Traceback (most recent call last):
[rank1]:   File "/users/jsmidt/AI/Huggingface/Torchrun/test1/test14/echo.py", line 41, in <module>
[rank1]:     dist.barrier()
[rank1]:   File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank1]:     return func(*args, **kwargs)
[rank1]:   File "/users/jsmidt/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3683, in barrier
[rank1]:     work = default_pg.barrier(opts=opts)
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.20.5
[rank1]: Last error:
[rank1]: Duplicate GPU detected : rank 1 and rank 5 both on CUDA device 41000
[rank2]: Traceback (most recent call last):
[rank2]:   File "/users/jsmidt/AI/Huggingface/Torchrun/test1/test14/echo.py", line 41, in <module>
...

This is not all the output, but enough to see the error that multiple processes are being mapped to the same GPU, and if you look at all the environment variable blocks like shown you see that all processes went to the same master node even though slurm initiated two nodes.

To make sure it’s not just an issue with the cluster, I tried running this with a vanilla srun command as:

#!/bin/bash

#Submit this script with: sbatch filename
#SBATCH --time=0:20:00   # walltime
#SBATCH --nodes=2   # number of nodes
#SBATCH --ntasks-per-node=4   # number of tasks per node
#SBATCH --job-name=gpt2   # job name
#SBATCH --qos=standard   # qos name
#SBATCH --mem=0


echo "NODELIST="${SLURM_NODELIST}
export MASTER_ADDR=$(scontrol show hostname ${SLURM_NODELIST} | head -n 1)
export MASTER_PORT=7000
export WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))


srun python echo.py

And here we see both nodes are being utilized, for example:

======================================================
Environment variables set by the agent on PID 27883:
{'MASTER_ADDR': 'nid001489',
 'MASTER_PORT': '7000',
 'SLURMD_NODENAME': 'nid001492',
 'SLURM_JOB_NODELIST': 'nid[001489,001492]',
 'SLURM_NODEID': '1',
 'SLURM_PROCID': '7',
 'WORLD_SIZE': '8'}
======================================================

======================================================
Environment variables set by the agent on PID 27882:
{'MASTER_ADDR': 'nid001489',
 'MASTER_PORT': '7000',
 'SLURMD_NODENAME': 'nid001492',
 'SLURM_JOB_NODELIST': 'nid[001489,001492]',
 'SLURM_NODEID': '1',
 'SLURM_PROCID': '6',
 'WORLD_SIZE': '8'}
======================================================

and no further error messages.

Anyone have any idea why torchrun would not utilize both slurm initiated nodes? Thanks!