I am new to this forum and also need to torchrun. So question is: how to get torchrun to run on k8s
When I try to invoke torchrun on a training script, I got an error:
[W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
I tried setting the redv socket IP without any luck…
So does torchrun meant to work on k8s?
(BTW, I was trying to run the mistral finetune example… It uses torch run in the tutori
Thx!