since the .backward will accumulate the gradient ,can I use loss.backward twice then optim.step?
would it take double gpu memory or not ?
would the effect is the same as using batch size*2?
1 Like
You could use a smaller batch size and accumulate the gradients. Then after a few iterations you could update the parameters using your optimizer.
Have a look at the 2nd option in this post.
It would yield the same behavior regarding the gradients, but note that other layers like BatchNorm
will behave differently, since they see smaller batches.
If that’s problematic, e.g. when your batch size is really small, then you could change the momentum a bit or use other normalization layers, e.g. GroupNorm
which should be more stable regarding smaller batch sizes.
5 Likes
For #2 from this post, do you need the flag retain_graph=True , i.e loss.backward(retain_graph=True)? It seems this is important?
I think the question has been answered here: