Free async request object

Hi,
I noticed that each async torch.distributed request object holds a pointer to the sent tensor,
therefore the buffer memory is not freed in the sender process until we explicitly call wait() or is_completed().
I am looking for a way to overcome this.
any suggestions?

Looks like only work from Gloo and MPI ProcessGroup holds those tensors. Does it work if you do not hold that async work/request object in application? Gloo and MPI ProcessGroup both have a queue and a runLoop to hold those work objects alive until processed.

Curious, if you don’t call wait() how do you know when the communication is finished and safe to consume the output?

If we destroy async objects before completion we get
“Attempted destruction of AsyncWork before work has completed, terminating the program.”

Therefore I must save in application all the sent request objects.

(Answering your question: I don’t consume the output of these (isend) messages)
( of course, the receiver calls wait() before consuming)
I’m using MPI (cuda-aware openmpi), with async p2p messages.

The goal is simple

  1. send tons of isend messages,
  2. get memory freed automatically on each completion
  3. finally, wait() on all the isends at the end of the program.

I see. Only send/recv/recvAnysource APIs returns AsyncWork, and the AsyncWork is not stored in the queue. Other collectives use a different WorkEntry data structure, which are stored in the queue. Looks to me we should consolidate these APIs to implement the same behavior.

@teng-li @pietern Any reason for having different async work data structures for MPI ProcessGroup?

Created an issue to track this. In the mean time, does it work for you if you put the async work into a queue and launch a separate thread to wait and dequeue? The wait API does release GIL, so it shouldn’t be blocking the main thread. This will lead to a similar behavior if we consolidate send/recv with collective async work. It won’t be perfect, as it does not guarantee immediately free tensors when comm is done when an earlier send/recv finished later. A better solution would need to install callbacks to MPI thread, which requires larger revamp and I am not sure if MPI supports that.

1 Like

Only send/recv/recvAnysource APIs returns AsyncWork and the AsyncWork is not stored in the queue

You mean isends too right?

Oh, sorry, I meant the C++ send/recv/recvAnysource API. The isend API is Python only. Both send and isend call into the same C++ send API, the only difference is whether it waits on the work.

@mrshenli I tried the solution with the cleaner thread and it doesn’t work:
seems like the wait() in the cleaner thread stops the whole process.
I think this is were its at in code.
I could only make it work with

while not r.is_completed():
    pass

but performance suffered a lot (~2x slowdown) compared my previous solution.

Hi I want to follow up on this thread. What is the status of this feature? What is the best practice to free the requests now?

@seliad wonder what you choset to implement eventually?

I see the issue @mrshenli opened is still open.

I ended up adding some python code to handle the freeing of buffers. Just keep the requests and occasionally check for completion and clean.

cc @agolynski for MPI and c10d