Depending on the model architecture the number of parameters and buffers might use less memory than the computed forward activations needed for the gradient calculation. This post explains it with an example.
Yes, using more layers, will create more intermediate activations, which need to be stored for the backward pass assuming you want to train the model.