Should backward() function be in the loop of epoch or batch?

When training nn models using Pytorch, is there a difference regarding where we place the backward method? For example, which one of below is correct?

Calculate gradient across the batch:

for e in range(epochs):
    for i in batches_list:
        out = nn_model(i)
        loss = loss_function(out, actual)
        loss_sum += loss.item()
        lstm.zero_grad()
        loss.backward()
        optimizer.step()
loss_list.append(loss_sum / num_train_obs)

Calculate gradient across the epoch:

for e in range(epochs):
    for i in batches_list:
        out = nn_model(i)
        loss = loss_function(out, actual)
        loss_sum += loss.item()
    lstm.zero_grad()
    loss_sum.backward()
    optimizer.step()     
loss_list.append(loss_sum / num_train_obs)

The former is correct. The latter will compute the gradient for the last minibatch only.

Thanks @crowsonkb . I think I had a typo in the second code. The backward() should be based on loss_sum, not loss. In this case, would your opinion still hold?

To accumulate gradients across an entire epoch, try something along the lines of the following code:

for e in range(epochs):
    optimizer.zero_grad()
    for i in batches_list:
        out = nn_model(i)
        loss = loss_function(out, actual)
        loss.backward()
        loss_sum += loss.item()
    optimizer.step()     
    loss_list.append(loss_sum / num_train_obs)

I guess your pseudocode and my second code are both correct in terms of accumulating gradients across the entire epoch? @crowsonkb

I think your second code will need loss_sum += loss instead of loss.item() to work at all? I think mine may be more efficient, allowing the intermediate computational graph for each batch to be freed at the end of each batch, by the loss.backward() call. Though someone with more experience with autograd might want to chime in here.

2 Likes

Your explanation is correct and @albanD posted some examples in this post a while ago.

1 Like