Work vs. Future sync primitives for Distributed Torch backends

This is a question related to the fundamental design of the internals of the Distributed Torch implementation.
To synchronize a distributed Torch execution, there are two ways, using the Work handles that each PG call returns, or the Future handle that can be derived from the Work one.
Why is that? Why not only relying on the Work handle (via work->wait())?
Also because, in the provided examples, those based on the C++ frontend seem to prefer this approach (work->wait()), while when using DDP from Python frontend, the internals are built upon the future concept, forcing anyone willing to implement a custom backend to support both, with the future implementation being quite intricate (also, why is it marked final?).
To me, these two try to serve the same purpose, creating a logic duplication, and creating confusion on when to use one over the other.
Is there a fundamental reason which brought to require the Future approach over the Work one, which seems to be the one first developed (since all PG calls return a Work object)?

Work is the one we use the most, and Future is only for the RPC feature currently.