Hey @Vibhatha_Abeykoon, thanks for the question, this actually relates to several WIP projects that we are working on now.
When I need to do such a task, my training script must be written in such a way that if the original model was M, now I have M1 – M16 smaller models which depends upon the output of the previous model in the sequence.
In this case, you need 4 instead of 16 smaller models, and within each model you can use
Tensor.to(device) to move data across GPUs as you mentioned below. For pipeline parallelism using RPC, this tutorial can serve as a reference (will be released with v1.6).
I am not sure whether this is the best way to do this. If this is wrong, please explain the best practices with the RPC API.
This is not the most convenient way to support pipeline parallelism. RPC is a lower-level API that offers flexibility but would require additional application code to orchestrate. One of the projects that we are looking into is to provide a higher-level API, e.g., a
DistributedPipelineParallel (DPP) (similar to the
DistributedDataParallel) which, ideally, can automatically divide the original model and place model shards (maybe) by using additional configuration hints or specific model structure (e.g.,
nn.Sequential). But this is still in discussion and no committed release date for it yet. Please do comment if you have suggestions or requirements on this feature.
I need to some how use an RPC call and send that data to the machine 2. This is the same case for all boundary conditions.
If you want the distributed autograd to automatically take care of the backward pass across machines, then yes, you will need to use RPC to send the intermediate output form machine 1 to machine 2. As of v1.6, RPC only accepts CPU tensors, so you will need to first move the tensor from
cpu on machine 1 and then move the received tensor from
cuda:0 on machine 2. We explicitly added this restriction to avoid unintentional device mapping errors through RPC. And we are working on a new device placement API (similar to the
torch.load) to make this easier, where application can define default device mappings between each pair of nodes and directly pass GPU tensors to RPC. We hope we can get this done in v1.7.
Data could be sent some how via a synchronization mechanism.
What do you mean by “a synchronization mechanism” here?
Is this something possible with the existing APIs. I am not quite clear how DistributedOptimizer and Distributed Autograd could handle this.
Yep, the tutorial linked above shows an example.