Race condition in Isend

seliad · December 26, 2019, 12:08pm

I get some bug during torch.distributed.isend and irecv: I think that there is a race condition there, but not sure how to debug this.
I’m using MPI, so the buffers have to stay untouched until the send/recv is over.
I see crazy “spikes” in error, which I get only with a certain degree of parallelism.

I wonder if pytorch/python garbage collector touches my buffers.
I saved them in a list, just in case.
Where can I check this in code? can I “guard” the buffers somehow?

I see pytorch tests barely check the Isend/Irecv, and want to verify that the bug is not internal…

mrshenli · December 26, 2019, 4:06pm

Hi @seliad

The code for MPI-based torch.distributed.isend is here: https://github.com/pytorch/pytorch/blob/cc16819028c325e2543d45752a875bd3c5e09b32/torch/lib/c10d/ProcessGroupMPI.cpp#L591

seliad · December 28, 2019, 6:13pm

I looked at the code and did not see something suspicious.
However, the data corruption is still there.

I found that if I do torch.distributed.synchronize(device) explicitly before Isends the problem is mitigated and can be mistaken to “solved”, but I don’t like this solution at all.
I don’t see any rational reason to do so, there is probably some bug there.

Reading the warnings here and here makes me believe that the MPI/distributed API probably does not do many stuff necessary for sharing tensors
like handling references counts, using mutex to guard stuff and etc.
I use CUDA-aware with openMPI, I thought it is supported.