However, when I print the content of each process I see that on each process local_rank is set to -1
How to get different and unique values in the local_rank argument? I thought launch.py was handling that?
Hi, I wasn’t able to repro your issue with torch-1.10.
Here’s the test script I tested with
# save this as test.py
import argparse
import os
import sys
from torch.distributed.elastic.multiprocessing.errors import record
def parse_args(argv):
parser = argparse.ArgumentParser(description="test script")
parser.add_argument("--local_rank", type=int)
return parser.parse_args()
@record
def main():
args = parse_args(sys.argv[1:])
print(f"local_rank={args.local_rank}")
if __name__ == "__main__":
main()
Then run
$ python -m torch.distributed.launch --nnode=1 --node_rank=0 --nproc_per_node=4 test.py
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
local_rank=3
local_rank=2
local_rank=0
local_rank=1
That line is there to make sure the error trace info on the trainer (a different process from the agent process) can be propagated to the agent for error summary and reporting purposes. See: Error Propagation — PyTorch 1.10.0 documentation
If the @record is not there then no trace information will be logged in the error summary table. So you’ll have to dig through the logs for the exception stack trace.