Std::system_error when distributed training using gloo

Trainning resnet101 for 19k steps (3 nodes), I got “Unexpected poll revent: 25 on socket: 9: Software caused connection abort”, what can I do to fix this?

thanks

1 Like

I am facing the same issue with 2 node training on a different model: “Unexpected poll revent: 25 on socket: 90: Software caused connection abort”. I am using the torch.distributed.launch utility.
Anyone able to resolve it? Or have any idea what might be causing it?