Is Loss.backward() function calculate gradients over mini-batch

I’m training my small many-to-one LSTM network to generate chars one at a time.
my training data are splited into mini-batches, each batch has the following shape
[batch size, sequence Length, Number of feature]
with batch_first=True in my LSTM unit
Now, after forward feeding my mini-batch to the network and calculating CrossEntropyLoss(), i call
Loss.backward() just one time so the gradients gets calculated and stored in .grad attribute.

My questions is :
consider a batch size of 16, isn’t gradient should be calculated 16 times (after each sequence in mini-batch) and accumulated then Averaged before updating the parameters using (optimiser.step()).

do Loss.backward() calculate the gradients across each mini-batch sequence or just calculate it one time after last sequence in batch. if so, how can i calculate gradient after each sequence in mini-batch then update weights by average gradients at the end.

Here is my training loop:

for epoch in range(num_epochs):
  for startidx in range(0, num_batches, batch_size):
    endidx = startidx + batch_size
    step = startidx//batch_size
    xbatch = xhot_seq[startidx:endidx]
    ybatch = yhot_seq[startidx:endidx]

    # Forward pass

    # Initialise hidden state
    hidden = model.init_hidden()
    # Clear stored gradient
    y_pred, hidden = model(xbatch, hidden)
    target = torch.argmax(ybatch.long(),dim=1)
    loss = loss_fn(y_pred, target )
    loss_hist[epoch] = loss.item()
    # Backward pass
    # Update parameters


Given that your loss is the mean of the loss on each sample, the computed gradient will be the mean of the gradients for each sample.

1 Like

Thanks a lot …

I am still confused that Loss.backward() calculate the gradients many times over the samples in the mini-batch or just calculate it one time? if I only want to update the weights according to the gradient for the subset of samples in the batch, what can I do? thanks


.backward() is not aware of the concept of “samples” so it will just compute the gradients for what is given to it.
It just happens that in general, what we give is the mean of the loss of the samples. And so what it computes is the mean of the gradients for each sample.

If your loss contains only the mean of the losses for a given set of samples, then you will get the average gradient from these samples.

Thanks so much!

but If I forward using the whole batch and backward using the subset loss, it’s also ok? Are the weights updated according to suset of samples? in another word, Is the average gradient for the subset of samples in this batch?

and how about reweighting the loss, will .backward() computes the average gradient for the reweighted samples?

The backward will compute the gradients of the loss you call it on. So whatever your loss is, you will get the gradient for that.

1 Like

ok, really thank you! :slight_smile: