Hi all,
I have a conceptual question regarding how to best use DistributedDataParallel (DDP) in a setup involving two models.
Suppose I have two nn.Module
instances: model1
and model2
. I want to train all parameters in model2
, but only a subset of parameters in model1
. The remaining parameters in model1
should remain frozen. Please note that both models are used in sequentially to compute the loss function.
Naturally, I would wrap model2
with DDP as follows:
model2 = DDP(model2, ...)
However, I am unsure how to ensure proper gradient synchronization across GPUs for the trainable parameters in model1
. Clearly, if I don’t include model1
in DDP, its parameters won’t be synchronized.
I see a couple of options:
-
Wrap a
nn.ModuleList
containing only the trainable submodules from both model1 and model2, and pass that into DDP. -
Wrap both model1 and model2 with DDP independently, making sure to set
requires_grad = False
for the frozen parameters in model1, as suggested in this post.
Are either of these approaches recommended or is there a better practice for this use case? I’d like to make sure I’m not missing something regarding how DDP handles parameter synchronization and gradient communication when mixing frozen and trainable parameters across multiple modules.
Thanks in advance!