Pytorch training question

I have been looking at the pytorch example for training a classifier here:

The training code is as follows:

for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients

        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

This might be a dumb question but nowhere do we actually check if the training loss has decreased or not (except to print the statistics). Is this happening in the background so that the model is updated if the training loss is actually doing down?

If yes, when are the weights updated? Do the weights get updated if the training loss decreases for each minibatch or do they get updated if the average error across all the mini batches go down (i.e. for the whole epoch?)

So, it seems that the optimizer.step() function is the one that will update the parameters. But looking at how the optimizer is used, I see the declaration:

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

Then at some point during the training we have:

outputs = net(inputs)
loss = criterion(outputs, labels)

The optimizer itself does not seem to have any access to the loss function. So how can it know whether a GD step is actually making things better or worst!

It does not know. It backpropagates whatever result is. You don’t selectively backpropagate depending on if loss if better or now, in fact, that would be impossible.

Realize that, in the early stage of training, network havent seen any data. For 1st iteration, loss will decrease (ideally), for the 2nd one, data may be otally different, thus, higher loss, but you need to backpropagate in order to keep learning.

When you backpropagate you compute gradients and error for whatever your loss is.

1 Like

Thank you for your answer! But how does the optimizer know when the current gradient direction is not optimal (i.e. to change the hyper parameters like learning rate etc.) if it has no access to the loss value.

It’s everything autograd machinery.

First of all, stop thinking about optimal directions, optimizer or any DL algorithm is agnostic to optimality.
When you call loss.backward(), each tensor in the model computes gradients following chain rule and carry gradients back to leaf tensors. For almost each torch function there exist a backward function which computes gradients when you call backward.

Each tensor points to tensors from which they are composed in order to follow chain rule. Have a look at this video in which autograd is explained.