Hi,
Due to some reasons, I need to launch training from another node. I hope it can work like this:
In node1:
torchrun --master_addr ip_of_node3 --master_port port_of_node3 train.py
In node2:
torchrun --master_addr ip_of_node3 --master_port port_of_node3 train.py
In node3:
torchrun_or_other_script start.py
It means that, only node1 and node2 has gpus and training only happens on node1 and node2, and node3 is only used to start training on node1 and node2.
I do this because I do not know the actual ip address of node1 and node2, but I can access node3 from both node1 and node2. Can I do this with pytorch ?