The model parameter and input size might be tiny compared to the stored intermediates needed for gradient computation.
Have a look at this post showing an example.
The model parameter and input size might be tiny compared to the stored intermediates needed for gradient computation.
Have a look at this post showing an example.