Performance gap between `batch_size==32` and `batch_size==8, gradient_accumulation==4`

pt_z · September 25, 2022, 11:07am

I tried to use gradient accumulation in my project. To my understanding, the gradient accumulation is the same as increasing the batch size by x times. I tried batch_size==32 and batch_size==8, gradient_accumulation==4 in my project, however the result varies even when I disabled shuffle in dataloader. The batch_size==8, accumulation==4 variant’s result is significantly poorer.

I wonder why?

Here is my snippet:

loss = model(x)
epoch_loss += float(loss)

loss.backward()

# step starts from 1
if (step % accumulate_step == 0) or (step == len(dataloader)):

    if clip_grad_norm > 0:
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_grad_norm)

    optimizer.step()
    if scheduler:
        scheduler.step()

    optimizer.zero_grad()

ptrblck · September 25, 2022, 8:11pm

That should be the case, if you are not using any layers which have a dependency on the batch size, such as batchnorm layers. Are you using these layers, which would explain the different outputs?

pt_z · September 26, 2022, 2:46am

Hi, I figured it out. See this question: python - Performance gap between `batch_size==32` and `batch_size==8, gradient_accumulation==4` - Stack Overflow

Everything is due to the dropout.