I tried to use gradient accumulation in my project. To my understanding, the gradient accumulation is the same as increasing the batch size by x times. I tried batch_size==32
and batch_size==8, gradient_accumulation==4
in my project, however the result varies even when I disabled shuffle
in dataloader. The batch_size==8, accumulation==4
variant’s result is significantly poorer.
I wonder why?
Here is my snippet:
loss = model(x)
epoch_loss += float(loss)
loss.backward()
# step starts from 1
if (step % accumulate_step == 0) or (step == len(dataloader)):
if clip_grad_norm > 0:
nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_grad_norm)
optimizer.step()
if scheduler:
scheduler.step()
optimizer.zero_grad()