Accumulate the gradients

Hello. Nowadays I notice that importance of batch size when optimize the models.
So I tried to push the envelope of memory limitation of CUDA.
And followed the kind explanation
this post .

For my case, I could train on 16 batch. But couldn’t 32 batch.
So I accumulate gradient by backward and
optimizer.step() when even numbers epoch.

But I encountered that
File “”, line 1065, in
File “”, line 641, in main
File “”, line 848, in train_epoch
File “/home//anaconda3/envs/tresnet/lib/python3.6/site-packages/torch/autograd/", line 49, in decorate_no_grad
return func(args, kwargs)
File "/home/
/", line 115, in step
File "/home/
/anaconda3/envs/tresnet/lib/python3.6/site-packages/torch/autograd/”, line 49, in decorate_no_grad
return func(*args, kwargs)
File "/home/
/", line 98, in set_hessian
grads, params, grad_outputs=zs, only_inputs=True, retain_graph=i < self.n_samples - 1)
File “/home/**/anaconda3/envs/resnet/lib/python3.6/site-packages/torch/autograd/”, line 157, in grad
inputs, allow_unused)
RuntimeError: CUDA out of memory. Tried to allocate 26.00 MiB (GPU 0; 10.76 GiB total capacity; 9.38 GiB already allocated; 16.56 MiB free; 9.77 GiB reserved in total by PyTorch)

The above is encountered at first time of optimizer.step().
So I tried with 8 batch and optimizer.step() when multiples of 4 epochs.
But nothing changed even decrease to 1 batch.

I wonder which allocate the 9.38 GiB already.
Or I wonder It is problem with second order optimization.
Would anybody help me?

You are using option 2 from there right?
Do you see the memory usage increase at each iteration?

Also depending on which optimizer you use it might need to allocate a fairly big state. You need to make sure that 1step + optim works fine before doing more steps.