Is it necessary to use with torch.no_grad() for feature extraction?

I’m attempting feature extraction in an unorthodox way. I extract features in eval() mode to switch off the batch norm and dropout layers and use the running means and std provided by ImageNet.

I use a feature extractor to extract features from two related images and concatenate the two tensors stackwise before passing through a linear dense classifier model for training. I’m wondering whether I can avoid using with torch.no_grad() as the two models are unrelated.

Here is a simplified version:

num_classes = 2 
num_epochs = 10
criterion = nn.CrossEntropyLoss().to(device)
optimizer = torch.optim.Adam(classifier.parameters(), lr=0.001)

densenet= DenseNetConv()
densenet.eval() # set densenet to eval to switch off batch norm and dropout layers
densenet.to(device)

classifier = nn.Linear(4416, num_classes)
classifier.to(device)

for epoch in range(num_epochs):
  
  classifier.train()

  for i, (inputs_1, inputs_2,  labels) in enumerate(dataloaders_dict['train']):
       inputs_1= inputs_1.to(device)
       inputs_2 = inputs_2.to(device)
       labels = labels.to(device)

       features_1 = densenet(inputs_1) # extract features 1
       features_2 = densenet(inputs_2) # extract features 2
       
       combined = torch.cat([features_1, features_2], dim=1) # combine features 
       combined = combined(-1, 4416) # reshape 
       
       optimizer.zero_grad()

       # Forward pass to get output/logits
       outputs =  classifier(combined)
           
       # Calculate Loss: softmax --> cross entropy loss
       loss = criterion(outputs, labels)
         
       _, pred = torch.max(outputs, 1)
       equality_check = (labels.data == pred)

       # Getting gradients w.r.t. parameters
       loss.backward()
       optimizer.step()
       

As you can see, I do not call with torch.no_grad(), despite having densenet.eval() as my separate feature extractor. Is there an issue with the way this is implemented or can I assume that this will not interfere with the classifier model?

You don’t need to calculate the gradient during the feature extraction.
It means

with torch.no_grad():
    features_1 = densenet(inputs_1) # extract features 1
    features_2 = densenet(inputs_2) # extract features 2

is okay.

1 Like

Would this not interfere with the classifier model I am training? Or can I get away with not applying with torch.no_grad():. ?

My only concern is whether the gradients can leak to the other models in the same batch loop.

I don’t get what interfere means but if it means I want to train the classifier only then,
there are at least 2 ways to do that.

  1. Freeze the feature extraction model
  2. Put only classifier parameters to the optimizer and get the features under torch.no_grad()

These are the best ways to what you want to do without wasting memory.

1 Like

ah ok, so the only issue is wasting memory,… from what you are saying, the feature extractor will have no impact on the classifier model parameters even if I don’t set torch.no_grad()?

(modify the reply)
.eval() does not prevent making gradient graph.
The entire model will be updated and this is not what you want.

The methods I suggested will help you :slight_smile:

For more details, please check the belows,
Module — PyTorch 1.12 documentation
Autograd mechanics — PyTorch 1.12 documentation

I’m confused, I know it won’t prevent the densenet graph from being created but does it impact the other model who’s parameters are set to be updated (classifier) as there are essentially two models in the loop…

Oh I missed the optimizer contains the classifier parameters only.

Your code would be work as you designed.

1 Like