How does DistributedDataParallel handle parameters whose requires_grad flag is False?

ayalaa2 · July 27, 2020, 10:26pm

Hello, I’m trying to figure out what happen behind the scenes in DistributedDataParallel when it comes to parameters that do not require a gradient. I cannot find a clear answer on this in the documentations.

Assume we have three layers: A --> B --> C. Suppose that A and B both have their requires_grad flag as False. If this model is wrapped in DistributedDataParallel, will there be any process communication that needs to be done in the backward pass of layer A or C? Specifically with the sharing of gradients.

My problem is that I have a large model and I am extremely bottle necked by process communication. I would like to freeze some of my layers such that less gradients need to be shared. I understand that depending on my model, the computation cost may be the same but I really need to bring down the communication cost.

mrshenli · July 28, 2020, 2:18pm

Hey @ayalaa2, DistributedDataParallel’s (DDP) ctor would go through all parameters and skip the ones whose requires_grad=False. So, there won’t be communication on those grad, but you will have to set their require_grad field before passing it to DDP. After the ctor, changing the requires_grad attribute makes no difference. See the code below in DDP ctor.

github.com

pytorch/pytorch/blob/6bd88f581a6323d026e08b65ffee75bfe162501f/torch/nn/parallel/distributed.py#L456-L464


# Build tuple of (module, parameter) for all parameters that require grads.
modules_and_parameters = [
    [
        (module, parameter)
        for module in replica.modules()
        for parameter in filter(
            lambda parameter: parameter.requires_grad,
            parameters(module, recurse=False))
    ] for replica in self._module_copies]

Another way to skip communication is to use the no_sync context manager.

wydwww · August 14, 2021, 8:51am

Hi @mrshenli, you mentioned that DDP can skip the gradient communication for parameters whose requires_grad=False but the flag must be set before wrapping the model with DDP.

I have 2 following questions:

If I set requires_grad=False during training after the DDP ctor, will these parameters be updated anymore, if they still conduct communication?
Is there a way to dynamically freeze more parameters and skip their communication during DDP training after constructing the DDP model?

My use case is gradually freezing more layers during training/fine-tuning. Thanks!

mrshenli · August 15, 2021, 8:31pm

Hey @wydwww

DDP only builds the communication buffer once (it’s actually twice: 1. using the reverse order of model.parameters, 2. using the autograd order of the first iteration). So DDP will still communicate all parameter’s gradients, even if some of them are marked as requires_grad=False after DDP ctor.

Is there a way to dynamically freeze more parameters and skip their communication during DDP training after constructing the DDP model?

If this is not very frequent (say, once per epoch), you can destroy the DDP instance and create a new one.

wydwww · August 16, 2021, 2:29am

Thanks. I will try this solution.

palo_bajo · May 27, 2022, 12:09am

I found this suggestion quite useful. I am facing exactly the same problem described here but I can’t fix it by just destroying the DDP and creating a new instance Basically I have a model wrapped around DDP, and after 10 epochs I want to freeze some parts using requires_grad = False and keep training the rest of the model.

To destroy the DDP model instance I am following Safely removing a Module from DDP, this is, model = model.module, but still not working.

Any suggestions?

kumpera · May 31, 2022, 1:40pm

Could you go in more detal of how is it exactly failing?
Do you have logs of the failure to help troubleshoot it?

wydwww · May 31, 2022, 2:06pm

Did you first extract the module from DDP (model = model.module), then create a new DDP with the extracted model? And what do you mean by not working?