How to update the trainable group for a DDP wrapped model

Ledzy · June 4, 2024, 8:25am

I am training a model with multiple modules (e.g. transformer layers), and I aim to switch the trainable layer during the training. For centralized training this is straightforward: I only need to change the attribute required_grad to determine which layer should be backward. However, when using DDP, this seems to be nontrivial, as the gradient needs to synchronized, and the synchronization is handled by the reduce hook registered when initializing the DDP model.

I found related github issues #21344 and #21591. It suggests directly destroying and reconstructing a DDP model may be a viable approach, yet it seems not very elegant and a bit time consuming for my use case.

Could you provide me some information about how to switch trainable parameter for a DDP model? I guess removing reduce old hooks and re-add new ones may be a clean and feasible solution, but I’m not sure how to do that exactly. cc @mrshenli

Any help would be appreciated!

agu · June 5, 2024, 11:41pm

Does setting find_unused_parameters=True work for you?

Ledzy · June 6, 2024, 6:13am

Thanks for your information.

Directly set find_unused_parameters=True and update the requires_grad field for each parameter switch doesn’t seem to work. From my understanding, the flag find_unused_parameters=True allows model to not synchonize some parts that are not involved in loss calculation. However, in my case, all the params are involved in loss calculation, and i only need to synchonize some of them (with requires_grad=True`).

One possible solution may be wrapping the whole model by DDP to assign grad reducer hook to every parameter, and only fire the hook when the parameter’s requires_grad=True. However, I don’t know how to implement this exactly. Is there any API that I can refer to?

Thanks!