Torchrun command not found, does it need to be installed seperately?

Brando_Miranda · February 3, 2022, 8:33pm

I was following the torchrun tutorial but at no point were we told how to install torchrun. Should it just be automatically there since I do have pytorch? Or what’s going on?

Output:

(meta_learning_a100) [miranda9@hal-dgx ~]$ torchrun --nnodes=1 --nproc_per_node=2 ~/ultimate-utils/tutorials_for_myself/my_l2l/dist_maml_l2l_from_seba.py
bash: torchrun: command not found...

I am asking this cuz the offical torch pages seem to suggest to use it torchrun (Elastic Launch) — PyTorch 1.10 documentation

e.g.

TORCHRUN (ELASTIC LAUNCH)
torchrun provides a superset of the functionality as torch.distributed.launch with the following additional functionalities:

Worker failures are handled gracefully by restarting all workers.

Worker RANK and WORLD_SIZE are assigned automatically.

Number of nodes is allowed to change between minimum and maximum sizes (elasticity).

Transitioning from torch.distributed.launch to torchrun
torchrun supports the same arguments as torch.distributed.launch except for --use_env which is now deprecated. To migrate from torch.distributed.launch to torchrun follow these steps:

If your training script is already reading local_rank from the LOCAL_RANK environment variable. Then you need simply omit the --use_env flag, e.g.:

torch.distributed.launch

torchrun

$ python -m torch.distributed.launch --use_env train_script.py
$ torchrun train_script.py
If your training script reads local rank from a --local_rank cmd argument. Change your training script to read from the LOCAL_RANK environment variable as demonstrated by the following code snippet:

torch.distributed.launch

torchrun

import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int)
args = parser.parse_args()

local_rank = args.local_rank
import os
local_rank = int(os.environ["LOCAL_RANK"])
The aformentioned changes suffice to migrate from torch.distributed.launch to torchrun. To take advantage of new features such as elasticity, fault-tolerance, and error reporting of torchrun please refer to:

Train script for more information on authoring training scripts that are torchrun compliant.

the rest of this page for more information on the features of torchrun.

Usage
Single-node multi-worker

>>> torchrun
    --standalone
    --nnodes=1
    --nproc_per_node=$NUM_TRAINERS
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
Fault tolerant (fixed sized number of workers, no elasticity):

>>> torchrun
    --nnodes=$NUM_NODES
    --nproc_per_node=$NUM_TRAINERS
    --rdzv_id=$JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$HOST_NODE_ADDR
    YOUR_TRAINING_SCRIPT.py (--arg1 ... train script args...)
HOST_NODE_ADDR, in form <host>[:<port>] (e.g. node1.example.com:29400), specifies the node and the port on which the C10d rendezvous backend should be instantiated and hosted. It can be any node in your training cluster, but ideally you should pick a node that has a high bandwidth.

cross posted:

Torchrun command not found, does it need to be installed seperately?

Output:

(meta_learning_a100) [miranda9@hal-dgx ~]$ torchrun --nnodes=1 --nproc_per_node=2 ~/ultimate-utils/tutorials_for_myself/my_l2l/dist_maml_l2l_from_seba.py
bash: torchrun: command not found...

cross: pytorch - Torchrun command not found for distributed training, does it need to be installed separetely? - Stack Overflow