How can I figure out the GPU memory overhead of an optimizer?

I’m interested in figuring out how much memory, an optimizer like Adam, adds compared to a no-overhead optimizer (like plain SGD).

Current plan is the following:

    print("START:")
    print(torch.cuda.memory_summary())

    with torch.cuda.amp.autocast(enabled=True):
        for _ in range(5):
            print("BEFORE BATCH:")
            print(torch.cuda.memory_summary())
            batch = torch.randn(2, 3, 224, 224).cuda()

            loss = model(batch)
            loss.backward()
            optimizer.zero_grad()
            loss_val.backward()
            optimizer.step()

            print("AFTER BATCH:")
            print(torch.cuda.memory_summary())
    print(torch.cuda.memory_summary())

However, I don’t like this plan because it’s not very precise. I think there should be a better way to measure the GPU memory overhead an optimizer is adding.