Pytorch gradient accumulation

Kalyan_Katikapalli · September 15, 2019, 10:02am

This article https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255 suggest to do gradient accumulation in the following way

model.zero_grad()                                   # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                     # Forward pass
    loss = loss_function(predictions, labels)       # Compute loss function
    loss = loss / accumulation_steps                # Normalize our loss (if averaged)
    loss.backward()                                 # Backward pass
    if (i+1) % accumulation_steps == 0:             # Wait for several backward steps
        optimizer.step()                            # Now we can do an optimizer step
        model.zero_grad()                           # Reset gradients tensors

I have two doubts.

What is the need of this step , loss=loss/accumulation_steps?
Suppose, there are 960 training instances and the memory can accommodate a maximum of 64. So, number of batches =15. So, if i choose accumulation_steps=2, the parameters are not updated for the last batch. Doesn’t it affect the performance of the model?

urw7rs · September 15, 2019, 2:31pm

loss gradients are added(accumulated) by loss.backward() and loss / accumulation_steps divides the loss in advance to average the accumulated loss gradients.
First, because batches that aren’t accumulated are wasted, you should make sure batches are divisible by accumulation_steps. Second, the last batch actually gets accumulated since the first batch gets accumulated. And I think (i + 1) should be I because of this.