Hi all,
I have a conceptual question regarding how to best use DistributedDataParallel (DDP) in a setup involving two models.
Suppose I have two nn.Module instances: model1 and model2. I want to train all parameters in model2, but only a subset of parameters in model1. The remaining parameters in model1 should remain frozen. Please note that both models are used in sequentially to compute the loss function.
Naturally, I would wrap model2 with DDP as follows:
model2 = DDP(model2, ...)
However, I am unsure how to ensure proper gradient synchronization across GPUs for the trainable parameters in model1. Clearly, if I don’t include model1 in DDP, its parameters won’t be synchronized.
I see a couple of options:
-
Wrap a
nn.ModuleListcontaining only the trainable submodules from both model1 and model2, and pass that into DDP. -
Wrap both model1 and model2 with DDP independently, making sure to set
requires_grad = Falsefor the frozen parameters in model1, as suggested in this post.
Are either of these approaches recommended or is there a better practice for this use case? I’d like to make sure I’m not missing something regarding how DDP handles parameter synchronization and gradient communication when mixing frozen and trainable parameters across multiple modules.
Thanks in advance!