You are not accounting the intermediate activations, which might use significantly more memory than the parameters and also fits:
This post estimates the memory usage from parameters and forward activations.
You are not accounting the intermediate activations, which might use significantly more memory than the parameters and also fits:
This post estimates the memory usage from parameters and forward activations.