I’m trying to use Isends with cuda aware openmpi.
I found that I need to explicitly call torch.cuda.synchronize(device) before every Isend (otherwise training error collapses). I get that problem even when I stash the sent tensor (so it will have a reference and therefore won’t be freed and overwritten).
I have tried it with several different settings:
- with P2P enabled GPU (GTX1080)
- and without P2P enabled GPUs (RTX2080ti). (in the latter case the sends must go through the host).
I wonder, what could be happening there?
(I am using a single thread with async operations)