Unsure how to migrate my containerized torch.distributed.launch script to torchrun

I run Pytorch from a container on an HPC system. A simplified version of my usual launch script looks something like this:

RANK=1
for i in ${HOSTS// /}
do
        if [[ "$i" != "$HOST0" ]] ; then
                ssh  $i $CONATINER_EXEC $CONTAINER_EXEC_ARGS python -m torch.distributed.launch --nproc_per_node=$NGPU --nnodes=$NHOST --node_rank=$RANK --master_addr="$HOST0" --master_port=12581 train.py $CONFIG &
                RANK=$((RANK+1))
        fi
done

$CONTAINER_EXEC $CONTAINER_EXEC_ARGS python -m torch.distributed.launch --nproc_per_node=$NGPU --nnodes=$NHOST --node_rank=0 --master_addr="$HOST0" --master_port=12581 train.py $CONFIG

This works great for me, because it means I don’t have to play games trying to keep a version of mpirun or gloo or whatever on the host system synchronized with those same tools in the container.

According to the distributed docs, torch.distributed.launch is about to be deprecated, in favor of torchrun which uses Elastic Launch instead of the old way of launching distributed torch. This is not really compatible with my workflow, but there doesn’t seem to be a non-deprecated way to replicate my manner of launching with torch.distributed.launch? Is my old workflow about to be deprecated?

Maybe Tristan can help here? @d4l3k