This article https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255 suggest to do gradient accumulation in the following way
model.zero_grad() # Reset gradients tensors
for i, (inputs, labels) in enumerate(training_set):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, labels) # Compute loss function
loss = loss / accumulation_steps # Normalize our loss (if averaged)
loss.backward() # Backward pass
if (i+1) % accumulation_steps == 0: # Wait for several backward steps
optimizer.step() # Now we can do an optimizer step
model.zero_grad() # Reset gradients tensors
I have two doubts.
- What is the need of this step , loss=loss/accumulation_steps?
- Suppose, there are 960 training instances and the memory can accommodate a maximum of 64. So, number of batches =15. So, if i choose accumulation_steps=2, the parameters are not updated for the last batch. Doesn’t it affect the performance of the model?