Hello,
I’m curious about the potential differences in memory usage and runtime when employing torch.no_grad()
for 10 forward passes (no gradient needed) followed by a normal forward-backward update versus using gradient accumulation for 10 forward passes (gradient needed) followed by backward update using accumulated gradients to update the model. I think the runtime should be smaller if using torch.no_grad()
as no gradient computation, how about the memory cost between two cases? Thanks.