The parameters could use a small fraction of the overall memory footprint depending on the model architecture.
E.g. conv layers have usually very few parameters (only the often small kernel and bias) while the output activation might be huge (which is needed for the gradient computation and is stored if Autograd is enabled). This post describes it in more detail for a ResNet architecture and you can already see the effect in this small code snippet:
x = torch.randn(1, 3, 224, 224)
print(x.nelement() * x.element_size())
# 602112
conv = nn.Conv2d(3, 64, 3, 1, 1)
params = sum([p.nelement() for p in conv.parameters()])
print(params * conv.weight.element_size())
# 7168
out = conv(x)
print(out.nelement() * out.element_size())
# 12845056
12845056 / 7168
# 1792
As you can see the output activation uses ~1700x more memory than the parameters of the conv layer.