Launch.py tool doesn't set local_rank properly

Hi,

I’m trying to launch a train.py DDP script to run over a 4-GPU machine.
i’m using the launch.py tool described here, (this experience is quite ugly btw, I which there was a clean PyTorch class to do that!) that is supposed to set local_rank properly in each process: “–local_rank: This is passed in via launch.py” as the documentation says.

python /home/ec2-user/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/distributed/launch.py \
    --nnode=1 \
    --node_rank=0 \
    --nproc_per_node=4 \
    train.py \
    --gpu-count 4 \
    --dataset . \
    --cache tmp \
    --height 604 \
    --width 960 \
    --checkpoint-dir . \
    --batch 10 \
    --workers 24 \
    --log-freq 20 \
    --prefetch 2 \
    --bucket $bucket \
    --eval-size 10 \
    --iterations 20 \
    --class-list a2d2_images/camera_lidar_semantic/class_list.json

However, in each of my processes local_rank = -1 (default value). What is wrong? how to get local_ranks each distinct?

cc @Kiuk_Chung @aivanou