How does loss function affect model during training?

This post was flagged by the community and is temporarily hidden.

The criterion is independent from the model and they “communicate” through the training process. I.e. the criterion calculates the loss, which is then used for the gradient calculation. The optimizer will then update the passed parameters such that the model reduces the loss.

This post was flagged by the community and is temporarily hidden.

Why would the model or optimizer need a handle to the criterion?
The loss.backward() call will calculate the gradients and will assign (or accumulate) these to the .grad attributes of all parameters. The loss is connected to the model via the computation graph and thus the backward pass has access to all used parameters. There is no need to pass the criterion handle around as it doesn’t contain anything and is just providing the loss calculation.

1 Like

This post was flagged by the community and is temporarily hidden.

Yes, this is done via Autograd, which creates a computation graph and assigns valid backward functions to the .grad_fn attribute of activation tensors.

Thanks. Here is original code layed out…

What line there does this “Autograd” piece and creates bindings
between “criterion” and/or loss “model” ?


PyTorch will track all differentiable operations on tensors requiring gradients:

x = torch.randn(1, 1, requires_grad=True)

y = x * 2
# <MulBackward0 object at 0x7f161e967490>

# loss can be anything
loss = y**2

# tensor([[-8.4696]])

This doc and this tutorial might be good starters.

Sorry to press this issue. If I can humbly ask one more question…

Yes that makes perfect sense because your line “y = x * 2” and then “loss = y**2
associates x with loss.

In my aforementioned link to the code and tutorial
there is nothing equivalent to that! My aforementioned link uses these 2 lines…

criterion = nn.CrossEntropyLoss()
loss = criterion(outputs, labels)

Where does the analog to your x get introduced!?

The model’s outputs tensor is attached to a computation graph as trainable parameters were used. Instead of the x tensor any .weight and/or .bias of layers was used, which also creates a computation graph:

x = torch.randn(1, 10)
lin = nn.Linear(10, 10)

out = lin(x)
loss = out.mean()

Yes that’s another one that is readily understood. The loss object is tied to x through the out = lin(x) line.

My example…

criterion = nn.CrossEntropyLoss()

loss = criterion(outputs, labels)

had the loss set to some generic instance of nn.CrossEntropyLoss without
any relation to the model or my equivalent of x to use your nomenclature.



x won’t receive any gradients in my latest example and you can verify it by accessing its .grad attribute, which will return None. It’s just the input to the model and the lin.weight and lin.bias attributes are now the leaf tensors receiving gradients.
The analogy is that a computatiom graph will be created by applying differentiable operations using trainable parameterrs. The model’s output corresponds to the y tensor and the lin.weight and lin.bias correspond to x from my first example.
The loss function is just applying other differentiable operations (in the same way a linear layer is performing a matmul).

Bottom line is all of your examples make sense.
The only example I can’t understand is the one

criterion = nn.CrossEntropyLoss()

loss = criterion(outputs, labels)

It this case the loss has no connection to any tensor anywhere?!


It does since output was created by the model as mentioned a few times already and as seen in the linked blog post:

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

Thanks. I think I get it now. This seems like an oddity in the way PyTorch does things…

So you’re saying in “outputs = model(images)”, that the outputs object will have more
than just the numerical predictions!? outputs will also have a link/handle (whatever you want to call it)
to the model object? I can accept that but that seems weird. Did I get it right now?


Yes, your explanation is correct. The “link” from the outputs tensor to the model’s parameters is done by Autograd and reflected in the computation graph and the outputs.grad_fn object. A very naive point of view would be to think about Autograd “recording” the operations in the forward pass by creating the computation graph. In the backward pass the grad_fn will be used to backpropagate through the entire graph. Internals might be more complicated (e.g. PyTorch is smart enough to figure out when to stop the backpropagation if no gradients are needed in previous operations).