My question is whether PyTorch supports Peer to Peer training without the need to initialise a group, as there may be a situation where the number of processes is not known apriori or can change throughout the training process. I saw the P2POp class, but it still seem to require that a group be initialised beforehand.
From what I understood torch elastic fails a sub group if one node fails in it. This is still undesired behaviour.
I am looking for a low level functionality which can establish a connection between two nodes, send/receive tensors through it, and allow for continued training even if some nodes disconnect from the group