I am wondering if it is possible to implement the following distributed training case.
From my understanding of the torch.distributed and RPC frameworks, it seems difficult, but I may have misunderstood.
Let’s say I have two workers (0 and 1) each with their GPU. I have a module on the worker 0, loaded on the GPU 0.
Of course, I can do forward and backward passes on this module from the worker 0. However, I would like to “share” this module with the worker 1, such that I can easily also forward and backward (with input loaded on the GPU 1) through the same module (loaded on GPU 0 with the same parameters) through the worker 1.
As far as I understand it, the distributed RPC framework allows me to create a Remote Module on the worker 0 that I can use from the worker 1, however that doesn’t answer my usecase, since the worker 0 can not also use this module. I do not want to duplicate the module on the GPU, but access it from two different workers.
Thank you for your help.