I am trying to run torchtune on two nodes using the following command:
tune run --rdzv-endpoint "${M_ADDR}:1234" --nnodes 2 --nproc-per-node 8 --rdzv-id 102 --rdzv-backend=c10d lora_finetune_distributed --config recipes/configs/llama3/70B_full.yaml
however, I receive torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError
Can someone confirm if the command is correct? Also, any suggestions on how to resolve issue ? Any tests I can try to help ?