With my limited insight in Pytorchs autograd functionality, I was in the belief, that a computational graph was dynamically created during the forward pass, and hence consumed a lot of memory, while this same memory was gradually freed during the backward pass, where the actual gradients are calculated.
However, when I monitor the memory usage during a forward and backward pass, I see that the forward pass consumes almost no memory, and that the backward pass repeatedly consumes and frees memory in a cyclical manner. I run on a CPU.
Could someone please explain to me why I experience this behaviour?
To save memory, the backward pass recompute some parts. The forward pass doesn’t store a lot so you can use big batch without fearing RAM/GPU-RAM issues. Then, during the backward pass you compute only what you need using the few values stored and you store only the needed gradients for backpropagation algorithm.
I hope it’s clear, don’t fear to ask more detailed/clear explanations