I want to add a new layer to the pretrained model, and the pretrained model will not be updated, only the added layer will be trained. So my question is that can I use torch.no_grad() to wrap the forward of the pretrained part? Is this reasonable? Will this reduce memory usage and speed up training?
Yes, this should work as shown in this small code snippet:
class MyModel(nn.Module):
def __init__(self):
super(MyModel,self).__init__()
self.conv1 = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
self.fc1 = nn.Linear(4*4*16, 10)
def forward(self, x):
with torch.no_grad():
x = F.relu(self.conv1(x))
x = x.view(x.size(0), -1)
x = F.relu(self.fc1(x))
return x
model = MyModel()
x = torch.randn(1, 3, 4, 4)
out = model(x)
out.mean().backward()
for name, param in model.named_parameters():
if param.grad is not None:
print(name, param.grad.abs().sum())
If you remove the torch.no_grad() guard, all layers will get gradients.
Alternatively you could set the requires_grad attribute to False.
So which way will be faster? torch.no_grad() or requires_grad=False?
In my view, torch.no_grad() will not caculate grad of inputs of layers in the pretrained model, while requires_grad=False do. So torch.no_grad() will be faster? Is that right?