Memory cost and running time of @torch.no_grad()

Hello,

I’m curious about the potential differences in memory usage and runtime when employing torch.no_grad() for 10 forward passes (no gradient needed) followed by a normal forward-backward update versus using gradient accumulation for 10 forward passes (gradient needed) followed by backward update using accumulated gradients to update the model. I think the runtime should be smaller if using torch.no_grad() as no gradient computation, how about the memory cost between two cases? Thanks.