Hi, there!
I am running the following code:
import logging
logging.basicConfig(level=logging.INFO)
with profile(activities=[ ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
with record_function(“model_inference”):
model(inputs)
logging.info(prof.key_averages().table(sort_by=“cuda_time_total”, row_limit=2))
and the results are as follows:
INFO:root:------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
aten::empty 0.58% 7.258ms 0.62% 7.673ms 4.270us 0.000us 0.00% 0.000us 0.000us 1.66 Kb 1.66 Kb 77.63 Gb 77.63 Gb 1797
aten::empty_strided 1.86% 23.133ms 2.98% 37.010ms 16.493us 0.000us 0.00% 1.000us 0.000us 0 b 0 b 24.62 Gb 24.62 Gb 2244
Self CPU time total: 1.242s
Self CUDA time total: 982.062ms
However, my total GPU memory is 49GB. My question is what does it mean the “CUDA mem” above being 77.63 GB and 24.62 GB?
1 Like
It says ‘Gb’ with lower case ‘b’, perhaps it could it mean Gigabits?
1 Like
As far as I know, memory is measured in bytes and not bits. Unless, one is talking about data transfer process which is measured in bits.
1 Like
From the docs:
PyTorch profiler can also show the amount of memory (used by the model’s tensors) that was allocated (or released) during the execution of the model’s operators. In the output below, ‘self’ memory corresponds to the memory allocated (released) by the operator, excluding the children calls to the other operators. To enable memory profiling functionality pass profile_memory=True
.
x = torch.randn(1024, 1024, 1024, device="cuda")
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], profile_memory=True, record_shapes=True) as prof:
with record_function("model_inference"):
for _ in range(1000):
y = x.relu()
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
# Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
# model_inference 0.00% 0.000us 0.00% 0.000us 0.000us 10.611s 50.01% 10.611s 10.611s 0 b 0 b 0 b 0 b 1
# model_inference 0.03% 2.733ms 49.14% 5.219s 5.219s 0.000us 0.00% 10.609s 10.609s 0 b 0 b 4.00 Gb -3996.00 Gb 1
# aten::relu 0.03% 3.307ms 49.11% 5.216s 5.216ms 0.000us 0.00% 10.609s 10.609ms 0 b 0 b 4000.00 Gb 0 b 1000
# aten::clamp_min 0.06% 6.282ms 49.08% 5.212s 5.212ms 10.609s 49.99% 10.609s 10.609ms 0 b 0 b 4000.00 Gb 4000.00 Gb 1000
# void at::native::vectorized_elementwise_kernel<4, at... 0.00% 0.000us 0.00% 0.000us 0.000us 10.609s 49.99% 10.609s 5.304ms 0 b 0 b 0 b 0 b 2000
# cudaLaunchKernel 49.02% 5.206s 49.02% 5.206s 2.603ms 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 2000
# cudaStreamIsCapturing 0.00% 0.631us 0.00% 0.631us 0.631us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
# cudaMalloc 0.00% 302.388us 0.00% 302.388us 302.388us 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
# cudaDeviceSynchronize 50.86% 5.401s 50.86% 5.401s 5.401s 0.000us 0.00% 0.000us 0.000us 0 b 0 b 0 b 0 b 1
# ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
# Self CPU time total: 10.620s
# Self CUDA time total: 21.220s
As you can see here, the intermediate allocations/releases are tracked, not the peak memory usage.
Thank you! I now realize that it may not be possible to get the peak memory usage.
torch.cuda.max_memory_allocated()
and torch.cuda.max_memory_reserved()
will return the peak values for the allocated and cached memory.
1 Like
I successfully trace GPU memory usage as you stated here. Is there a way to get the same for CPU memory? I sometimes use the CPU instead of the GPU.
Otherwise, how bad would it be to assume that the CPU uses the same as the GPU when training on CPU only?