Why is this simple model becoming so enormous during training?

The parameters could use a small fraction of the overall memory footprint depending on the model architecture.
E.g. conv layers have usually very few parameters (only the often small kernel and bias) while the output activation might be huge (which is needed for the gradient computation and is stored if Autograd is enabled). This post describes it in more detail for a ResNet architecture and you can already see the effect in this small code snippet:

x = torch.randn(1, 3, 224, 224)
print(x.nelement() * x.element_size())
# 602112

conv = nn.Conv2d(3, 64, 3, 1, 1)
params = sum([p.nelement() for p in conv.parameters()])
print(params * conv.weight.element_size())
# 7168

out = conv(x)
print(out.nelement() * out.element_size())
# 12845056

12845056 / 7168
# 1792

As you can see the output activation uses ~1700x more memory than the parameters of the conv layer.