Torchtune distributed issue

I am trying to run torchtune on two nodes using the following command:

tune run --rdzv-endpoint "${M_ADDR}:1234" --nnodes 2 --nproc-per-node 8 --rdzv-id 102 --rdzv-backend=c10d lora_finetune_distributed --config recipes/configs/llama3/70B_full.yaml

however, I receive torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError

Can someone confirm if the command is correct? Also, any suggestions on how to resolve issue ? Any tests I can try to help ?

I’m not sure about torchtune, but it seems like it is using torchrun: torchrun (Elastic Launch) — PyTorch 2.6 documentation

I’m guessing the M_ADDR you are using is not accessible. If you just torchrun and a simple script does that work?