Does torch TCP connections between processes reliable?

mmsz · March 4, 2021, 7:57am

Hi, i am wondering in the case of connection break and message lost, for example the connection from master to worker process, 1. will torch handle reconnect? 2. if the connection got restored somehow, will torch retry to send the lost message and let the job keep running or just rely on TCP reliability and simply failed the job? Thanks!

agolynski · March 9, 2021, 7:00pm

Hi!

Currently torch doesn’t handle node failures well.
We rely on TCP reliability for node connectivity, but we also have some robustness mechanisms like rpc retries

You can also look into TorchElastic — PyTorch/Elastic master documentation to handle node failures.