I am training a model with multiple modules (e.g. transformer layers), and I aim to switch the trainable layer during the training. For centralized training this is straightforward: I only need to change the attribute required_grad
to determine which layer should be backward. However, when using DDP, this seems to be nontrivial, as the gradient needs to synchronized, and the synchronization is handled by the reduce hook registered when initializing the DDP model.
I found related github issues #21344 and #21591. It suggests directly destroying and reconstructing a DDP model may be a viable approach, yet it seems not very elegant and a bit time consuming for my use case.
Could you provide me some information about how to switch trainable parameter for a DDP model? I guess removing reduce old hooks and re-add new ones may be a clean and feasible solution, but I’m not sure how to do that exactly. cc @mrshenli
Any help would be appreciated!