This article https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255 suggest to do gradient accumulation in the following way
model.zero_grad() # Reset gradients tensors for i, (inputs, labels) in enumerate(training_set): predictions = model(inputs) # Forward pass loss = loss_function(predictions, labels) # Compute loss function loss = loss / accumulation_steps # Normalize our loss (if averaged) loss.backward() # Backward pass if (i+1) % accumulation_steps == 0: # Wait for several backward steps optimizer.step() # Now we can do an optimizer step model.zero_grad() # Reset gradients tensors
I have two doubts.
- What is the need of this step , loss=loss/accumulation_steps?
- Suppose, there are 960 training instances and the memory can accommodate a maximum of 64. So, number of batches =15. So, if i choose accumulation_steps=2, the parameters are not updated for the last batch. Doesn’t it affect the performance of the model?