Hi, i am wondering in the case of connection break and message lost, for example the connection from master to worker process, 1. will torch handle reconnect? 2. if the connection got restored somehow, will torch retry to send the lost message and let the job keep running or just rely on TCP reliability and simply failed the job? Thanks!
- Currently torch doesn’t handle node failures well.
- We rely on TCP reliability for node connectivity, but we also have some robustness mechanisms like rpc retries
You can also look into TorchElastic — PyTorch/Elastic master documentation to handle node failures.