Yep, the first chart looks correct to me.
If so, is there any chances to enhance it? Makes it more like this chart below.
Do you need single-machine multi-GPU pipeline parallel or multi-machine pipeline parallel?
If it is within a single machine, it is possible to parallelize backward as well. Check this project torchgpipe. It inserts phony dependencies between stages of different micro batches.
If it’s multi-machine pipeline parallel, then you will need RPC. As of today distributed autograd cannot parallelize backward, because the smart mode has not been implemented yet. To get around this, you can still use RPC and RRef, but cannot use distributed autograd and will need to manually stitch together local autograd.
Another possibility (not 100% sure if this would work) is to create one distributed autograd context per micro batch, and manually call
__exit__ on distributed autograd context. As a result, the gradients for different microbatches will be stored in different contexts, and hence you will need to call
dist_optimizer.step(ctx_id) multiple times to apply the gradients.
Things will become a lot easier when we add smart mode distributed autograd.