P2P Cuda Aware MPI problem

I’m trying to use Isends with cuda aware openmpi.

I found that I need to explicitly call torch.cuda.synchronize(device) before every Isend (otherwise training error collapses). I get that problem even when I stash the sent tensor (so it will have a reference and therefore won’t be freed and overwritten).

I have tried it with several different settings:

  1. with P2P enabled GPU (GTX1080)
  2. and without P2P enabled GPUs (RTX2080ti). (in the latter case the sends must go through the host).

I wonder, what could be happening there?
(I am using a single thread with async operations)