I get some bug during torch.distributed.isend and irecv: I think that there is a race condition there, but not sure how to debug this.
I’m using MPI, so the buffers have to stay untouched until the send/recv is over.
I see crazy “spikes” in error, which I get only with a certain degree of parallelism.
I wonder if pytorch/python garbage collector touches my buffers.
I saved them in a list, just in case.
Where can I check this in code? can I “guard” the buffers somehow?
I see pytorch tests barely check the Isend/Irecv, and want to verify that the bug is not internal…
I looked at the code and did not see something suspicious.
However, the data corruption is still there.
I found that if I do
torch.distributed.synchronize(device) explicitly before Isends the problem is mitigated and can be mistaken to “solved”, but I don’t like this solution at all.
I don’t see any rational reason to do so, there is probably some bug there.
Reading the warnings here and here makes me believe that the MPI/distributed API probably does not do many stuff necessary for sharing tensors
like handling references counts, using mutex to guard stuff and etc.
I use CUDA-aware with openMPI, I thought it is supported.