Freezing layers issue for parallel GPU

Han_Brian_Lee · March 18, 2020, 4:17pm

Hi,

So I think I found out what the cause for having weird error (see below) when trying to freeze layers using the following the following method (context manager also fails to freeze layers) is using DataParallel (model = nn.DataParallel(model)) across multiple GPUs. I’ve been running my model on 2 identical GPUs (gtx1080) and when I tried to freeze weights, I got the error shown below. When I don’t apply DataParallel and just use single GPU, freezing layers works like a charm as shown in countless examples across the internet and on this forum, but when DataParallel is used, it gives out a weird error (doesn’t make sense to me… the layers it should try to freeze are not leaf variables)

Could anyone help me understand what is happening and if this is a bug in pytorch? (regardless of whether it is or not, I’d like to know if there is a workaround)

for param in model.parameters():
  param.requires_grad = False

RuntimeError: you can only change requires_grad flags of leaf variables. If you want to use a computed variable in a subgraph that doesn’t require differentiation use var_no_grad = var.detach().

Thanks,

ptrblck · March 19, 2020, 2:46am

Could you post a code snippet to reproduce this issue?
This dummy code snippet works:

model = models.resnet18()

for param in model.parameters():
    param.requires_grad = False

model = nn.DataParallel(model)
model.cuda()

out = model(torch.randn(8, 3, 224, 224))
print(out.shape)

Han_Brian_Lee · March 23, 2020, 5:02pm

Here it is. If you comment out the model = nn.DataParallel(model) line, no error occurs, otherwise the error occurs (on a computer with multiple GPUs):

from __future__ import print_function
from __future__ import division
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

class mynet(torch.nn.Module):
    def __init__(self):
        super(mynet, self).__init__()
        self.fc = nn.Linear(2, 1, bias=False)

    def forward(self, x):        
        for param in self.fc.parameters():
            param.requires_grad = False

        return self.fc(x)

model = mynet()
model = nn.DataParallel(model)  # comment out this line to make the error go away
model.cuda()

data = np.array([0.0, 1.0])
x = torch.from_numpy(data).float().unsqueeze(0).unsqueeze(0).unsqueeze(0)
gt = torch.from_numpy(np.array([0.0])).float().unsqueeze(0).unsqueeze(0).unsqueeze(0)

# model.train()
result = model(x=x.cuda())

ptrblck · March 24, 2020, 2:34am

Thanks for the code.
The error is most likely thrown, as you are trying to manipulate the require_grad attribute of the model replica parameters.
If you freeze the parameters outside of the forward method, it’ll work.

What’s your workflow that you want to freeze them in each forward pass?

Han_Brian_Lee · March 24, 2020, 11:22pm

Hm… I see. I ideally wanted to train some iterations with a part of my network frozen, and some iterations after that with it not frozen, and alternate.

I have 4 sub-networks(a,b,c,d). And there’s a big wrapper network (say N) that contains those 4 sub-networks, where the wrapper network’s flow goes (a,b separately)->combined into c->then d. And I wanted to freeze subnetwork a’s weights (load pickled trained weights for subnetwork a and don’t train them).

Not considering alternate freezing/unfreezing, what’s the best way to achieve freezing weights in this case?
Would it be like the following? But Once DataParallel wraps the model, the “a” subnetwork is no longer accessible from model. What’s the best way of freezing subnetwork a’s weights for multi-GPU setting? Some way of maybe model.modules()? or model._modules[‘module’].a.parameters() ?

model = N()
model = nn.DataParallel(model)  
for parameter in model.a.parameters():
    parameter.requires_grad = False

If I wanted to alternate freezing/unfreezing, what’s the ideal way?

ptrblck · March 24, 2020, 11:46pm

You should still be able to freeze the submodules via model.module.a after you’ve wrapped them in nn.DataParallel. As long as your changes are done outside of the forward pass, which is executed on each replica, it should be fine.