How to calculate VGG feature loss without saving unnecessary gradient

Hi, I used a pre-trained VGG network as a feature extractor and compute L1Loss between VGG features of two images. Before, i implement this by zero gradients of VGG each time.
But today, I saw an implementation which set require_gradients= False . I am curious that if require_dradient is False, how do the gradients backpropagate to the network before VGG? Is gradients of VGG being zeroed after each backpropagation?

Here are the part of the codes.

class Vgg19(torch.nn.Module):
    def __init__(self, requires_grad=False):
        super(Vgg19, self).__init__()
        vgg_pretrained_features = models.vgg19(pretrained=True).features
        self.slice1 = torch.nn.Sequential()
        self.slice2 = torch.nn.Sequential()
        self.slice3 = torch.nn.Sequential()
        self.slice4 = torch.nn.Sequential()
        self.slice5 = torch.nn.Sequential()
        for x in range(2):
            self.slice1.add_module(str(x), vgg_pretrained_features[x])
        for x in range(2, 7):
            self.slice2.add_module(str(x), vgg_pretrained_features[x])
        for x in range(7, 12):
            self.slice3.add_module(str(x), vgg_pretrained_features[x])
        for x in range(12, 21):
            self.slice4.add_module(str(x), vgg_pretrained_features[x])
        for x in range(21, 30):
            self.slice5.add_module(str(x), vgg_pretrained_features[x])
        if not requires_grad:
            for param in self.parameters():
                param.requires_grad = False

    def forward(self, X):
        h_relu1 = self.slice1(X)
        h_relu2 = self.slice2(h_relu1)        
        h_relu3 = self.slice3(h_relu2)        
        h_relu4 = self.slice4(h_relu3)        
        h_relu5 = self.slice5(h_relu4)                
        out = [h_relu1, h_relu2, h_relu3, h_relu4, h_relu5]
        return out

class VGGLoss(nn.Module):
    def __init__(self, gpu_ids):
        super(VGGLoss, self).__init__()        
        self.vgg = Vgg19().cuda()
        self.criterion = nn.L1Loss()
        self.weights = [1.0/32, 1.0/16, 1.0/8, 1.0/4, 1.0]        

    def forward(self, x, y):              
        x_vgg, y_vgg = self.vgg(x), self.vgg(y)
        loss = 0
        for i in range(len(x_vgg)):
            loss += self.weights[i] * self.criterion(x_vgg[i], y_vgg[i].detach())        
        return loss

Do you need the gradients for your pretrained VGG model or are you using it as a fixed feature extractor?
If the latter is true, you can use Variable(..., volatile=True), so that it will use the absolute minimal amount of memory to evaluate the model.

Parameters = network weights (W). It’s telling the network not to compute d/dW.
The inputs (X) still require gradients d/dX, and only these are required for chain rule. d/dW depends on all the d/dX’s ahead of it, but not vice-versa.

1 Like

Hi @Ginsunuva,

Thank you for your reply, I would like to learn more about it.

For example, assuming the pre-trained model is a weighted MSE loss,

where x is the input, y is the target, alpha is a parameter of the pre-trained model and f is the model I want to train, according to your answer, we still have to compute dL/df, but we don’t have to compute dL/d(alpha), so we can save GPU memory via this way.

Do I understand correctly ?