Which one makes training faster (or more efficient), detach() or requires_grad = False

yoshitomo-matsubara · December 28, 2021, 4:39am

Hello,

When training only specific layers in a model, which one should make training faster, detach() or requires_grad = False? Or no difference?

Assuming you have some pretrained model and want to fine-tune some of its layers while freezing the other layers and your optimizer contains updatable parameters only (i.e., those with requires_grad = False are not passed to optimizer)

Approach 1: ResNet with frozen conv1, bn1, layer1, and layer2

    def forward(self, x):
        x = self.conv1(x) # its parameters have requires_grad = False
        x = self.bn1(x) # its parameters have requires_grad = False
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x) # its parameters have requires_grad = False
        x = self.layer2(x) # its parameters have requires_grad = False
        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

vs.

Approach 2: ResNet with detach() after layer2

    def forward(self, x):
        x = self.conv1(x) 
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)

        x = x.detach()

        x = self.layer3(x)
        x = self.layer4(x)

        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)

        return x

Also, would the answer change if optimizer contains all the parameters in a model (without requires_grad = False) but with detach() after layer2?

Thank you! (and Happy Holidays!)

ptrblck · December 30, 2021, 2:01am

I think both approaches would yield the same execution as Autograd should be smart enough to stop the backpropagation where no gradients are needed in previous layers (approach 1).
The difference of passing all parameters or only the subset which requires gradients would be in this check, which checks if a valid gradient attribute is set for the current parameter and skips it if not. I also doubt you would see a significant performance difference and would thus claim the used approach depends on your coding style and how explicit you want to write your code.
Personally, I would avoid passing unused parameters to the optimizer just to avoid any side effects.

yoshitomo-matsubara · December 30, 2021, 2:44am

Thank you @ptrblck for the answer! Good to hear that there would not be a significant difference in terms of performance.

I also agree with you and would avoid passing unused parameter to the optimizer.