How does optimizer.step() work over a batch and multiple output neurons?

Hi, beginner with pytorch here, not understanding exactly how backpropagation with loss.backward() and optimizer.step() is working with my code. Can someone help me clarify this? My question is as follows in 2 parts:

  1. I have a neural network that I’m training that 3 output neurons. So, when I feed an input into my model, I get a predicted with size 3x1. I then compare that to my actual of size 3x1 to get a loss:

loss = torch.nn.functional.mse_loss(predicted, actual)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()

loss here is a single number (1x1). Is loss supposed to be a single number here, even with the output layer size being 3? My understanding is that since loss is an mse over the 3 actual - predicted differences (aka summing over them), it contains the information needed to calculate gradients & backpropagate through all 3 output neurons. Please correct me on anything I’m wrong about here.

  1. In reality I’m computing batches of 32 at a time, such that my predicted and actual are 32x3 in size. Using the above lines, loss is still a 1x1, with which I update my model at the end of the batch. Is this the correct way to backpropagate when doing mini-batch training like I’m doing? My loss here is a single number that averages over losses in the 3 layers and across the 32 samples in the batch. Or should I be getting 1 loss per sample instead?

Sorry if this question doesn’t make a whole lot of sense. Currently very confused :slight_smile:

This sounds fine to me. The idea is that when you backpropagate through the loss function, the output elements aka inputs to the loss will receive the corresponding gradient proportional to the derivative of their contribution.