How do I determine why my optimizer is taking up so much memory?

Goldname · March 31, 2024, 1:59am

I am experiencing CUDA OOM. It occurs at optimizer.step(). I go from around 20GiB to more than 40 GiB during this step. Although my model is large, I actually froze most of it, so the optimizer shouldn’t be adding so much additional memory. Does anyone know how I can debug this? I am freezing my model like this:

for name, param in model.model.named_parameters():
    param.requires_grad = False

srishti-git1110 · March 31, 2024, 8:59am

Hi,
Which optimizer are you using? Adaptive optimizers like Adam require more VRAM to be able to store various internal states as compared to non adaptive ones like plain SGD.
Also, could you please confirm the training step at which this spike occurs? According to me and in my experience, it should happen at the very first step (1st batch of the 1st epoch) after which the vram consumption should relatively stabilize.

Goldname · March 31, 2024, 7:05pm

You’re correct that I am using Adam and that this spike occurs at the first iteration. However, the trainable parameters of my model is very small. I don’t see why Adam is using 20 gb of ram for a small model.

srishti-git1110 · April 2, 2024, 10:22am

I see.
Could you please see and share the output of the following -

param_size = sum([param.nelement() for param in model.parameters()])
optimizer_size = sum([param.nelement() for param in optimizer.param_groups[0]['params']])
grads_size = param_size
input_batch_size = # no. of units in the input batch. For eg. batch_size * 3 (channels) * 32 * 32 (image dimensions)

total_units = param_size + optimizer_size + grads_size + input_batch_size # forward activations are missing here btw
total_vram = total_units * 4 / (1024**3)
print(total_vram)

This should give you an idea of how much memory is being taken up by each part of the pipeline. Note that I have not included forward activations in this code (you need to register forward hooks for that) but let’s get an idea using this once.