Confused about "set_grad_enabled"

Carsten_Ditzel · February 27, 2019, 3:35pm

I am confused about the following snippet taken from the tutorials about transfer learning.

 for phase in ['train', 'val']:
            if phase == 'train':
                scheduler.step()
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # zero the parameter gradients
                optimizer.zero_grad()

                # forward
                # track history if only in train
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1)
                    loss = criterion(outputs, labels)

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

What does the part

  with torch.set_grad_enabled(phase == 'train'):

do? I thought that model.eval() and model.train() is enough to put the model into states in which to save and evaluate backprob information.

How does it compare to requires_grad() ? do the fulfill the same purpose?
Why is the epoch_loss divided by the size of the training set? According to the docs the loss is by default averaged over the batch size (reduction=‘mean’), so why would I divide by the entire data set size in the end?

ptrblck · February 27, 2019, 3:43pm

model.train() and model.eval() change the behavior of some layers. E.g. nn.Dropout won’t drop anymore and nn.BatchNorm layers will use the running estimates instead of the batch statistics. The torch.set_grad_enabled line of code makes sure to clear the intermediate values for evaluation, which are needed to backpropagate during training, thus saving memory. It’s comparable to the with torch.no_grad() statement but takes a bool value.
All new operations in the torch.set_grad_enabled(False) block won’t require gradients. However, the model parameters will still require gradients.
The running_loss will be “de-averaged” by multiplying it with inputs.size(0). Therefore you should divide by the whole dataset length, not the number of batches.

Carsten_Ditzel · February 27, 2019, 3:52pm

thank you for your reply, I see clearer now.

Recarding question 3:

Why would one de-average the loss only to divide later by the entire data set size? for smoothing reasons of the results?
Can the de-averaging also be done by omitting the multiplication with

inputs.size(0)

and setting the reduction parameter to ‘none’ in the loss function initialization?

ptrblck · February 27, 2019, 3:56pm

If your dataset length is not divisible by the batch size, the last batch will contain less samples than all others. Thus taking the averaged loss and divide it by the number of batches will give a a slightly biased result.
You could use sum to get the accumulated loss for debugging purposes. However, I would still recommend to take the mean for the loss for backpropagation, since otherwise e.g. the learning rate will depend on the batch size.

Carsten_Ditzel · February 27, 2019, 3:58pm

thank you very much!

Upgrade_Yourself · December 14, 2019, 2:31pm

Another thing, scheduler.step() must be called after optimizer.step() according to the docs:
https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

hoangle_tttm · March 17, 2021, 3:31pm

In your second statement :" All new operations in the torch.set_grad_enabled(False) block won’t require gradients. However, the model parameters will still require gradients".
For example, my pretrained model is an “Encoder” and all parameters of it have been set to “False” in my init function of my “Feed Forward Classifier”. In my train loop, “with torch.set_grad_enabled(phase == ‘train’):…” is used and phase==“train” is True.
Will my classifier model update my freeze parameter of my Encoder?

ptrblck · March 18, 2021, 2:43am

No, it won’t.
Setting set_grad_enabled to True allows the gradient calculation, but will not (somehow) force Autograd to track all operations:

encoder = nn.Linear(1, 1, bias=False)
encoder.weight.requires_grad_(False)

classifier = nn.Linear(1, 1, bias=False)

x = torch.randn(1, 1)

with torch.set_grad_enabled(True):
    out = encoder(x)
    out = classifier(out)
    
out.backward()
print(encoder.weight.grad)
> None

print(classifier.weight.grad)
> tensor([[0.3575]])

hoangle_tttm · March 18, 2021, 6:42am

Thank you!! So “torch.set_grad_enabled” (a context manager) only decides whether or not to perform the gradient calculation within it and not changes any parameters’ “requires_grad” values.
If torch.set_grad_enabled(True) only the gradient calculation of “requires_grad=True” parameters within block will be done.

Carsten_Ditzel · March 30, 2021, 7:49am

exactly, it does not switch on the calculation of gradients for parameters that were previously disabled