DDP launch.py: how different processes can receive different local rank?

Hi, I’m using this doc to launch a DDP script examples/README.md at master · pytorch/examples · GitHub

my launch code is

 python /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/distributed/launch.py \
    --nnodes 1 \
    --node_rank=0 \
    --nproc_per_node 4 \
    train.py \
    --gpu-count 4 \
    --dataset . \
    --cache tmp \
    --height 604 \
    --width 960 \
    --checkpoint-dir . \
    --batch 10 \
    --workers 24 \
    --log-freq 20 \
    --prefetch 2 \
    --bucket $bucket \
    --eval-size 10 \
    --iterations 20 \
    --class-list a2d2_images/camera_lidar_semantic/class_list.json

However, when I print the content of each process I see that on each process local_rank is set to -1
How to get different and unique values in the local_rank argument? I thought launch.py was handling that?

cc @aivanou, @Kiuk_Chung

1 Like

Hi, I wasn’t able to repro your issue with torch-1.10.

Here’s the test script I tested with

# save this as test.py

import argparse
import os
import sys

from torch.distributed.elastic.multiprocessing.errors import record

def parse_args(argv):
    parser = argparse.ArgumentParser(description="test script")
    parser.add_argument("--local_rank", type=int)
    return parser.parse_args()

def main():
    args = parse_args(sys.argv[1:])

if __name__ == "__main__":

Then run

$ python -m torch.distributed.launch --nnode=1 --node_rank=0 --nproc_per_node=4 test.py
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

The example seems out of date. Please follow the instructions here: torchrun (Elastic Launch) — PyTorch 1.10.0 documentation

thanks a lot I’ll try that!
so you’re using

python -m torch.distributed.launch


isn’t the fresh guidance to use torchrun?

What is that line doing?

That line is there to make sure the error trace info on the trainer (a different process from the agent process) can be propagated to the agent for error summary and reporting purposes. See: Error Propagation — PyTorch 1.10.0 documentation

If the @record is not there then no trace information will be logged in the error summary table. So you’ll have to dig through the logs for the exception stack trace.

1 Like