How to handle training of few layers with DDP

Hi all,

I have a conceptual question regarding how to best use DistributedDataParallel (DDP) in a setup involving two models.

Suppose I have two nn.Module instances: model1 and model2. I want to train all parameters in model2, but only a subset of parameters in model1. The remaining parameters in model1 should remain frozen. Please note that both models are used in sequentially to compute the loss function.

Naturally, I would wrap model2 with DDP as follows:

model2 = DDP(model2, ...)

However, I am unsure how to ensure proper gradient synchronization across GPUs for the trainable parameters in model1. Clearly, if I don’t include model1 in DDP, its parameters won’t be synchronized.

I see a couple of options:

  1. Wrap a nn.ModuleList containing only the trainable submodules from both model1 and model2, and pass that into DDP.

  2. Wrap both model1 and model2 with DDP independently, making sure to set requires_grad = False for the frozen parameters in model1, as suggested in this post.

Are either of these approaches recommended or is there a better practice for this use case? I’d like to make sure I’m not missing something regarding how DDP handles parameter synchronization and gradient communication when mixing frozen and trainable parameters across multiple modules.

Thanks in advance!

I would stick to the linked post explaining that frozen parameters will be skipped in DDP’s communication. You would thus have to set the .requires_grad attributes of all frozen parameters to False first before creating the DDP instance.

Thank you! I’ll mark your answer as the solution :slight_smile: