Gradient accumulation

Let’s say I want to experiment with large batch size that does not fit in memory.
Would batch accumulation be equivalent to running model with a large batch size?

For example, say batch is of size 10 and I have the following pseudo code

for i, batch in enumerate(train_loader):
      loss = compute_loss(batch)
      loss.backward()

      if i%9 == 0:
            optimizer.step()
            scheduler.step()

Would this be equivalent to running just

for i, batch in enumerate(train_loader):
      loss = compute_loss(batch)
      loss.backward()
      optimizer.step()
      scheduler.step()

if we had the size of batch 100?

It would be equivalent, if you are not using batch size - dependent layers such as batchnorm layers.
These layers update their internal running stats using the stats of the current input batch and the momentum term, which would be different for smaller and larger batch sizes.
Besides that, random operations, such as dropout layers could also change the final results, since the outputs would be randomly dropped in both approaches.

@ptrblck you are right,i was using gradient accumulation with filter response normalization and with batch size = 4 and it didn’t perform well,then used same model with batch size = 8 and without gradient accumulation and it worked fine there