Understanding GPU vs CPU memory usage

I’m quite new to trying to productionalize PyTorch and we currently have a setup where I don’t necessarily have access to a GPU at inference time, but I want to make sure the model will have enough resources to run.

Based on the documentation I found, I have 2 main tools available, one is the profiler and the other is torch.cuda.max_memory_allocated(). The latter is quite straightforward, apparently my model is using around 1GB of CUDA memory at inference.

I’m more interested in when no GPU is available, for this I’m running profiler, and the memory requirements seem massively increased. First of all, am I supposed to sum up the Self CPU Mem rows to get the required memory for inference, or is that memory freed up between layers? How would I go about calculating the actual memory requirements needed for my solution based on the profiler output? For context, we’re running dockerized images on Azure Kubernetes, my image serves a Flask application via gunicorn (I’m planning on moving to pytorch serve later).

My Flask app does this, this is how I’m checking the memory requirements currently:

@torch.inference_mode()
def donut_predict(input_img, task_prompt=TASK_PROMPT):
    global PRETRAINED_MODEL
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats(torch.device('cuda'))
        PRETRAINED_MODEL.half()
        device = torch.device("cuda")
        PRETRAINED_MODEL.to(device)
    with profile(activities=[ProfilerActivity.CPU],
            profile_memory=True, record_shapes=True) as prof:

        output = PRETRAINED_MODEL.inference(image=input_img, prompt=task_prompt)["predictions"][0]
    memory_used = torch.cuda.max_memory_allocated(torch.device('cuda')) / 1024**3  # Convert to GB
    app.logger.info(f"Max CUDA memory used for inference: {memory_used:.2f} GB")
    app.logger.info(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

    return output

and the logger returns the following when I’m only using a CPU:

service-1  |                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls
service-1  | -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
service-1  |                   aten::addmm        44.34%        2.575s        48.67%        2.827s     433.555us       3.80 Gb       3.79 Gb          6520
service-1  |                   aten::empty         0.20%      11.592ms         0.20%      11.592ms       1.167us       3.65 Gb       3.65 Gb          9929
service-1  |                     aten::add         4.70%     272.890ms         4.70%     272.890ms      70.170us       2.77 Gb       2.77 Gb          3889
service-1  |                     aten::bmm        11.07%     642.846ms        11.07%     642.862ms     197.439us       1.81 Gb       1.76 Gb          3256
service-1  |                    aten::gelu         4.26%     247.196ms         4.26%     247.196ms     299.995us       1.66 Gb       1.66 Gb           824
service-1  |                     aten::cat         2.20%     128.061ms         2.40%     139.552ms      75.679us       1.55 Gb       1.45 Gb          1844
service-1  |                aten::_softmax         3.29%     191.281ms         3.29%     191.281ms     117.494us       1.39 Gb       1.39 Gb          1628
service-1  |                     aten::mul         0.93%      53.730ms         1.17%      67.930ms      27.920us     428.84 Mb     428.78 Mb          2433
service-1  |                      aten::mm        14.84%     861.773ms        14.84%     861.774ms       4.224ms      98.90 Mb      96.70 Mb           204
service-1  |              aten::empty_like         0.04%       2.332ms         0.13%       7.669ms      13.893us       2.65 Gb      73.29 Mb           552
service-1  | -----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
service-1  | Self CPU time total: 5.808s

I’d be happy to get any guidance, thank you!

The actual memory usage will depend on your setup.
E.g. different GPU architectures and CUDA runtimes will vary in the CUDA context size. The actual size will also very depending if CUDA’s lazy module loading is enabled or not. Starting with the PyTorch binaries shipping with CUDA >= 11.7 we’ve enabled it by default. This will create a small context at the init time and will lazily load the device kernel code into the context once a new kernel is called. If your workflow uses dynamic shapes the context size could thus grow.
Also, depending on your model you might use cudnn.benchmark = True, which will profile available kernels for your current use case and will select the fastest one which uses a workspace which would fit into your device memory.
As you can see, a lot of factors depend on your actual setup. While a theoretical memory usage can be calculated based on the number of parameters and intermediate activations (this post gives you an example) you should add an expected overhead for the aforementioned points.

1 Like

Hi, thank you for the answer! I’m on CUDA 11.8 and this is just for inference in inference mode, And for that, 1GB of VRAM seems super nice. I’m more interested in the CPU-RAM side of thing in case a CUDA device is not available. As mentioned in the post, I’m not sure how to interpret the profiler output. The values in the Self CPU Mem column add up to like 18GB while the CUDA max mem checker returns 1GB when using CUDA. Is this reasonable? Does the CPU inference actually use 17x the memory CUDA uses? Or am I not supposed to sum that column up? Because that seems like a very stark difference and I was wondering if I’m understanding this correctly. On the other hand, if the CPU Mem should not be summed up, that’d make sense because then the maximum value would be 3.79 and CUDA inference is done in .half() mode and 1GB vs 3.79GB seems more reasonable.

Most importantly, given this profiler output, do I need ballpark 4GB or ballpark 20GB memory to host this model with CPU inference? Bottom line that’s what I’m trying to figure out.

how to enable the CUDA’s lazy module loading ?
thanks!

If you want to enable it for other applications, use export CUDA_MODULE_LOADING="LAZY".