Confused about "set_grad_enabled"

I am confused about the following snippet taken from the tutorials about transfer learning.

 for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs =
                labels =

                # zero the parameter gradients

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds ==

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]
  1. What does the part
  with torch.set_grad_enabled(phase == 'train'):

do? I thought that model.eval() and model.train() is enough to put the model into states in which to save and evaluate backprob information.

  1. How does it compare to requires_grad() ? do the fulfill the same purpose?

  2. Why is the epoch_loss divided by the size of the training set? According to the docs the loss is by default averaged over the batch size (reduction=‘mean’), so why would I divide by the entire data set size in the end?

  1. model.train() and model.eval() change the behavior of some layers. E.g. nn.Dropout won’t drop anymore and nn.BatchNorm layers will use the running estimates instead of the batch statistics. The torch.set_grad_enabled line of code makes sure to clear the intermediate values for evaluation, which are needed to backpropagate during training, thus saving memory. It’s comparable to the with torch.no_grad() statement but takes a bool value.

  2. All new operations in the torch.set_grad_enabled(False) block won’t require gradients. However, the model parameters will still require gradients.

  3. The running_loss will be “de-averaged” by multiplying it with inputs.size(0). Therefore you should divide by the whole dataset length, not the number of batches.


thank you for your reply, I see clearer now.

Recarding question 3:

  1. Why would one de-average the loss only to divide later by the entire data set size? for smoothing reasons of the results?

  2. Can the de-averaging also be done by omitting the multiplication with


and setting the reduction parameter to ‘none’ in the loss function initialization?

1 Like
  1. If your dataset length is not divisible by the batch size, the last batch will contain less samples than all others. Thus taking the averaged loss and divide it by the number of batches will give a a slightly biased result.

  2. You could use sum to get the accumulated loss for debugging purposes. However, I would still recommend to take the mean for the loss for backpropagation, since otherwise e.g. the learning rate will depend on the batch size.


thank you very much!

1 Like

Another thing, scheduler.step() must be called after optimizer.step() according to the docs:


In your second statement :" All new operations in the torch.set_grad_enabled(False) block won’t require gradients. However, the model parameters will still require gradients".
For example, my pretrained model is an “Encoder” and all parameters of it have been set to “False” in my init function of my “Feed Forward Classifier”. In my train loop, “with torch.set_grad_enabled(phase == ‘train’):…” is used and phase==“train” is True.
Will my classifier model update my freeze parameter of my Encoder?

No, it won’t.
Setting set_grad_enabled to True allows the gradient calculation, but will not (somehow) force Autograd to track all operations:

encoder = nn.Linear(1, 1, bias=False)

classifier = nn.Linear(1, 1, bias=False)

x = torch.randn(1, 1)

with torch.set_grad_enabled(True):
    out = encoder(x)
    out = classifier(out)
> None

> tensor([[0.3575]])

Thank you!! So “torch.set_grad_enabled” (a context manager) only decides whether or not to perform the gradient calculation within it and not changes any parameters’ “requires_grad” values.
If torch.set_grad_enabled(True) only the gradient calculation of “requires_grad=True” parameters within block will be done.

exactly, it does not switch on the calculation of gradients for parameters that were previously disabled

1 Like