How to measure peak cuda memory usage?

Is there any way to measure the peak cuda memory usage without hurting the execution time?
I know there is multiple functions kindly provided in torch.cuda. But they don’t include the CUDA overhead memory. i.e. There is always a gap between torch.cuda.max_memory_reserved() and what I see in nvidia-smi.
An alternative solution would be initiating a thread to keep calling nvidia-smi and track the maximum value. But threading will hurt the execution time of the main process, which is also the other term I want to benchmark.
Is there any solution for measuring peak gpu memory usage and also total execution time?
Thanks very much.

torch.cuda.max_memory_reserved() (don’t know if that function or any similar)
Shows the peak, not the real memory usage. Memory is reused on demand. When the allocator does not longer need the space it’s marked as available but not “freed” so that that memory slot can be overwritten.

It’s like hard disks, when you delete something you are not physically deleting it but marking that space as available.
That’s why there is no pairing between nvidia-smi and memory


Thanks @ JuanFMontesinos.
But my problem seems not be solved.
Does anyone have a solution?
Thanks again.

@b02202050 Did you get a solution for this?
Regardless of the performance, simply running a background thread to query nvidia-smi seems not feasible. How do you set the query frequency to make sure you can capture the maximum memory usage?