This is a question related to the fundamental design of the internals of the Distributed Torch implementation.
To synchronize a distributed Torch execution, there are two ways, using the Work
handles that each PG call returns, or the Future
handle that can be derived from the Work one.
Why is that? Why not only relying on the Work handle (via work->wait()
)?
Also because, in the provided examples, those based on the C++ frontend seem to prefer this approach (work->wait()
), while when using DDP
from Python frontend, the internals are built upon the future concept, forcing anyone willing to implement a custom backend to support both, with the future implementation being quite intricate (also, why is it marked final
?).
To me, these two try to serve the same purpose, creating a logic duplication, and creating confusion on when to use one over the other.
Is there a fundamental reason which brought to require the Future
approach over the Work
one, which seems to be the one first developed (since all PG calls return a Work
object)?
Work
is the one we use the most, and Future
is only for the RPC feature currently.