CUDA Memory Profiling: perculiar memory values

Pytorch_User_me · June 20, 2024, 2:49pm

Hi, there!

I am running the following code:

import logging
logging.basicConfig(level=logging.INFO)

with profile(activities=[ ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
with record_function(“model_inference”):
model(inputs)
logging.info(prof.key_averages().table(sort_by=“cuda_time_total”, row_limit=2))

and the results are as follows:
INFO:root:------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls

                                        aten::empty         0.58%       7.258ms         0.62%       7.673ms       4.270us       0.000us         0.00%       0.000us       0.000us       1.66 Kb       1.66 Kb      77.63 Gb      77.63 Gb          1797  
                                aten::empty_strided         1.86%      23.133ms         2.98%      37.010ms      16.493us       0.000us         0.00%       1.000us       0.000us           0 b           0 b      24.62 Gb      24.62 Gb          2244

Self CPU time total: 1.242s
Self CUDA time total: 982.062ms

However, my total GPU memory is 49GB. My question is what does it mean the “CUDA mem” above being 77.63 GB and 24.62 GB?

soulitzer · June 20, 2024, 2:55pm

It says ‘Gb’ with lower case ‘b’, perhaps it could it mean Gigabits?

Pytorch_User_me · June 20, 2024, 3:45pm

As far as I know, memory is measured in bytes and not bits. Unless, one is talking about data transfer process which is measured in bits.

ptrblck · June 24, 2024, 3:32pm

From the docs:

PyTorch profiler can also show the amount of memory (used by the model’s tensors) that was allocated (or released) during the execution of the model’s operators. In the output below, ‘self’ memory corresponds to the memory allocated (released) by the operator, excluding the children calls to the other operators. To enable memory profiling functionality pass profile_memory=True.

x = torch.randn(1024, 1024, 1024, device="cuda")
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True, record_shapes=True) as prof:
    with record_function("model_inference"):
        for _ in range(1000):
            y = x.relu()

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                                    Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
#                                         model_inference         0.00%       0.000us         0.00%       0.000us       0.000us       10.611s        50.01%       10.611s       10.611s           0 b           0 b           0 b           0 b             1  
#                                         model_inference         0.03%       2.733ms        49.14%        5.219s        5.219s       0.000us         0.00%       10.609s       10.609s           0 b           0 b       4.00 Gb   -3996.00 Gb             1  
#                                              aten::relu         0.03%       3.307ms        49.11%        5.216s       5.216ms       0.000us         0.00%       10.609s      10.609ms           0 b           0 b    4000.00 Gb           0 b          1000  
#                                         aten::clamp_min         0.06%       6.282ms        49.08%        5.212s       5.212ms       10.609s        49.99%       10.609s      10.609ms           0 b           0 b    4000.00 Gb    4000.00 Gb          1000  
# void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us       10.609s        49.99%       10.609s       5.304ms           0 b           0 b           0 b           0 b          2000  
#                                        cudaLaunchKernel        49.02%        5.206s        49.02%        5.206s       2.603ms       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          2000  
#                                   cudaStreamIsCapturing         0.00%       0.631us         0.00%       0.631us       0.631us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
#                                              cudaMalloc         0.00%     302.388us         0.00%     302.388us     302.388us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
#                                   cudaDeviceSynchronize        50.86%        5.401s        50.86%        5.401s        5.401s       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
# -------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
# Self CPU time total: 10.620s
# Self CUDA time total: 21.220s

As you can see here, the intermediate allocations/releases are tracked, not the peak memory usage.

Pytorch_User_me · June 24, 2024, 6:36pm

Thank you! I now realize that it may not be possible to get the peak memory usage.

ptrblck · June 25, 2024, 2:16pm

torch.cuda.max_memory_allocated() and torch.cuda.max_memory_reserved() will return the peak values for the allocated and cached memory.

Agustin_Barrachina · March 17, 2025, 10:59am

I successfully trace GPU memory usage as you stated here. Is there a way to get the same for CPU memory? I sometimes use the CPU instead of the GPU.
Otherwise, how bad would it be to assume that the CPU uses the same as the GPU when training on CPU only?