Work vs. Future sync primitives for Distributed Torch backends

This is a question related to the fundamental design of the internals of the Distributed Torch implementation.
To synchronize a distributed Torch execution, there are two ways, using the Work handles that each PG call returns, or the Future handle that can be derived from the Work one.
Why is that? Why not only relying on the Work handle (via work->wait())?
Also because, in the provided examples, those based on the C++ frontend seem to prefer this approach (work->wait()), while when using DDP from Python frontend, the internals are built upon the future concept, forcing anyone willing to implement a custom backend to support both, with the future implementation being quite intricate (also, why is it marked final?).
To me, these two try to serve the same purpose, creating a logic duplication, and creating confusion on when to use one over the other.
Is there a fundamental reason which brought to require the Future approach over the Work one, which seems to be the one first developed (since all PG calls return a Work object)?

Work is the one we use the most, and Future is only for the RPC feature currently.

I see.
I finally found a solution, which required extending the Future class.
Nonetheless, Work and Future are conceptually very similar, so why not just extending Work to support RPC instead of creating a separate Future class, as both have a .wait() method for example.
I guess that would have been much easier to handle and support.
Last, is this assuming that RPC is involved in DDP?