Taking a "step" with torch.optim object

This is an excerpt from the classic “Training a classifier” tutorial on PyTorch.org.

The loss function and optimizer are defined as below:

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

in order to evaluate loss and update gradients to take an optimization step like so:

outputs = net(inputs)
loss = criterion(outputs, labels)

Forgive the possibly stupid question but where is the link between the loss function and optimizer object? All that optimizer needs for initiation seems to be the parameters of net which simply has the definitions of the linear, activation, softmax etc. Where do we tell the optimizer that it is the gradient of the loss function w.r.t these parameters that guides the step? In other words, the parameters are there but not what we are taking the gradient of.

The gradients are calculated during the loss.backward() call.
You can try to print the gradients before and after this step using print(net.fc1.weight.grad).
Before the backward pass the gradients will be empty, after it you will see some values.
The optimizer just “knows” how to update the provided parameters using this gradient.
You can find more information in this beginner tutorial.

Thanks for info, but I have gone through the tutorial at a high level. It is exactly how the optimizer knows that I am asking about. So, we have a loss function with gradients on the variables that decide (along with step size/learning rate) what the next values of variables should be. It seems odd that the construct - torch.optim instance - that takes the step knows only about the parameters ( it looks that way from the statement “optimizer = …” ) and not the loss function. How does it pick up gradient values correctly? In other words, what if I had another loss function on the same set of parameters (for e.g., loss_2 = nn.NLLLoss(outputs, labels) where outputs and labels are as above)? How would it know which gradient to use to take the next “step”? One pictures a setup where the forward-backward construct (net), the loss function and the optimizer have to work in tandem and not seeing it.

The loss function does not know the next values. It just calculates the current gradients for all necessary parameters.
The optimizer uses these gradients of the provided parameters to update the weights.
Some optimizers have a momentum term or other parameters for running averages. Have a look at e.g. Adam.
So indeed the optimizer only needs to know which Parameters to update and how to do it. The needed gradient has to be provided by the backward pass of your loss function.

The gradients are accumulated by default. So in your case both loss functions would calculate the gradients which are summed for each parameter. That is also the reason you have so zero out the gradient before the next backward pass (optimizer.zero_grad() in your training code).

Ok, that explains it, thx.