I’m quite new to trying to productionalize PyTorch and we currently have a setup where I don’t necessarily have access to a GPU at inference time, but I want to make sure the model will have enough resources to run.
Based on the documentation I found, I have 2 main tools available, one is the profiler and the other is torch.cuda.max_memory_allocated()
. The latter is quite straightforward, apparently my model is using around 1GB of CUDA memory at inference.
I’m more interested in when no GPU is available, for this I’m running profiler, and the memory requirements seem massively increased. First of all, am I supposed to sum up the Self CPU Mem rows to get the required memory for inference, or is that memory freed up between layers? How would I go about calculating the actual memory requirements needed for my solution based on the profiler output? For context, we’re running dockerized images on Azure Kubernetes, my image serves a Flask application via gunicorn (I’m planning on moving to pytorch serve later).
My Flask app does this, this is how I’m checking the memory requirements currently:
@torch.inference_mode()
def donut_predict(input_img, task_prompt=TASK_PROMPT):
global PRETRAINED_MODEL
if torch.cuda.is_available():
torch.cuda.reset_peak_memory_stats(torch.device('cuda'))
PRETRAINED_MODEL.half()
device = torch.device("cuda")
PRETRAINED_MODEL.to(device)
with profile(activities=[ProfilerActivity.CPU],
profile_memory=True, record_shapes=True) as prof:
output = PRETRAINED_MODEL.inference(image=input_img, prompt=task_prompt)["predictions"][0]
memory_used = torch.cuda.max_memory_allocated(torch.device('cuda')) / 1024**3 # Convert to GB
app.logger.info(f"Max CUDA memory used for inference: {memory_used:.2f} GB")
app.logger.info(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))
return output
and the logger returns the following when I’m only using a CPU:
service-1 | Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
service-1 | ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
service-1 | aten::addmm 44.34% 2.575s 48.67% 2.827s 433.555us 3.80 Gb 3.79 Gb 6520
service-1 | aten::empty 0.20% 11.592ms 0.20% 11.592ms 1.167us 3.65 Gb 3.65 Gb 9929
service-1 | aten::add 4.70% 272.890ms 4.70% 272.890ms 70.170us 2.77 Gb 2.77 Gb 3889
service-1 | aten::bmm 11.07% 642.846ms 11.07% 642.862ms 197.439us 1.81 Gb 1.76 Gb 3256
service-1 | aten::gelu 4.26% 247.196ms 4.26% 247.196ms 299.995us 1.66 Gb 1.66 Gb 824
service-1 | aten::cat 2.20% 128.061ms 2.40% 139.552ms 75.679us 1.55 Gb 1.45 Gb 1844
service-1 | aten::_softmax 3.29% 191.281ms 3.29% 191.281ms 117.494us 1.39 Gb 1.39 Gb 1628
service-1 | aten::mul 0.93% 53.730ms 1.17% 67.930ms 27.920us 428.84 Mb 428.78 Mb 2433
service-1 | aten::mm 14.84% 861.773ms 14.84% 861.774ms 4.224ms 98.90 Mb 96.70 Mb 204
service-1 | aten::empty_like 0.04% 2.332ms 0.13% 7.669ms 13.893us 2.65 Gb 73.29 Mb 552
service-1 | ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
service-1 | Self CPU time total: 5.808s
I’d be happy to get any guidance, thank you!