I tried to use gradient accumulation in my project. To my understanding, the gradient accumulation is the same as increasing the batch size by x times. I tried `batch_size==32`

and `batch_size==8, gradient_accumulation==4`

in my project, however the result varies even when I disabled `shuffle`

in dataloader. The `batch_size==8, accumulation==4`

variant’s result is significantly poorer.

I wonder why?

Here is my snippet:

```
loss = model(x)
epoch_loss += float(loss)
loss.backward()
# step starts from 1
if (step % accumulate_step == 0) or (step == len(dataloader)):
if clip_grad_norm > 0:
nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip_grad_norm)
optimizer.step()
if scheduler:
scheduler.step()
optimizer.zero_grad()
```