Hello,
When training only specific layers in a model, which one should make training faster, detach()
or requires_grad = False
? Or no difference?
Assuming you have some pretrained model and want to fine-tune some of its layers while freezing the other layers and your optimizer contains updatable parameters only (i.e., those with requires_grad = False
are not passed to optimizer)
Approach 1: ResNet with frozen conv1
, bn1
, layer1
, and layer2
def forward(self, x):
x = self.conv1(x) # its parameters have requires_grad = False
x = self.bn1(x) # its parameters have requires_grad = False
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x) # its parameters have requires_grad = False
x = self.layer2(x) # its parameters have requires_grad = False
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
vs.
Approach 2: ResNet with detach() after layer2
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = x.detach()
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
Also, would the answer change if optimizer contains all the parameters in a model (without requires_grad = False
) but with detach()
after layer2
?
Thank you! (and Happy Holidays!)