Understanding Memory Profiler output (Autograd Profiler with Memory stats)

Can somebody help me understand the following output log generated using the autograd profiler, with memory profiling enabled. My specific questions are the following:

  • What’s the difference between CUDA Mem and Self CUDA Mem?

  • Why some of the memory stats negative (how to reason them)?

  • How to compute the total memory utilization (the total averages displayed at the bottom)?

Thanks in advance!

                             Name          CUDA Mem  Self CUDA Mem     
                      aten::empty          3.06 Gb       3.06 Gb      
                    aten::random_              0 b           0 b      
          aten::is_floating_point              0 b           0 b      
                       aten::item              0 b           0 b      
                   aten::randperm              0 b           0 b      
                      aten::randn          6.50 Kb           0 b      
                    aten::randint              0 b           0 b      
                     aten::select              0 b           0 b      
                        aten::mul          3.00 Kb           0 b      
                       aten::set_              0 b           0 b      
                       aten::view              0 b           0 b      
                    aten::permute              0 b           0 b      
                 aten::contiguous              0 b           0 b      
                        aten::div              0 b           0 b      
                      aten::stack              0 b           0 b      
                      aten::zeros              0 b           0 b      
              Copy data to device          1.50 Mb           0 b      
                       Forward D0        625.91 Mb      -2.00 Kb      
       aten::binary_cross_entropy          1.50 Kb      -1.50 Kb      
                      Backward D0          1.00 Kb      -1.00 Kb      
                     MulBackward0          1.50 Kb           0 b      
       BinaryCrossEntropyBackward          1.50 Kb           0 b      
                 SqueezeBackward1              0 b           0 b      
                     ViewBackward              0 b           0 b      
                  SigmoidBackward          1.50 Kb           0 b      
         CudnnConvolutionBackward          1.62 Gb           0 b      
   torch::autograd::CopyBackwards          2.45 Gb           0 b      
  torch::autograd::AccumulateGrad        846.79 Mb           0 b      
               LeakyReluBackward1        409.50 Mb           0 b      
           CudnnBatchNormBackward        225.16 Mb           0 b      
                       Forward G0         22.58 Mb    -770.00 Kb      
                       Forward D1        625.16 Mb      -2.50 Kb      
                      Backward D1              0 b      -1.00 Kb      
                      Optimizer D            512 b      -2.50 Kb      
                       Forward D2        625.16 Mb      -2.50 Kb      
                       Backward G              0 b      -1.00 Kb      
                     TanhBackward        768.00 Kb           0 b      
CudnnConvolutionTransposeBackward         14.32 Mb           0 b      
                    ReluBackward1          7.50 Mb           0 b      
                      Optimizer G            512 b      -2.50 Kb      
---------------------------------  -- ------------  ------------  ----
Self CPU time total: 11.786s
CUDA time total: 12.148s

<FunctionEventAvg key=Total self_cpu_time=11.786s cpu_time=9.369ms  self_cuda_time=12.148s cuda_time=10.153ms input_shapes=[[1]] cpu_memory_usage=20845792 cuda_memory_usage=28655543808>
  • Self (resource like time or memory on a specific device) indicates the resource spent on that device on the specified routine excluding the resource spent in the functions it calls. For example,
------------- ------------   ------------  ------------  ------------  
   Name        Self CPU       CPU total      CPU Mem      Self CPU Mem   
-------------- ------------  ------------  ------------  ------------ 
   aten::mm    1.883s          1.883s         9.81 Gb           0 b 

Here, aten:mm spends all of its time within that function and it does not spend anytime in calling other functions.(self CPU = CPU total). However, Self CPU Mem is 0b meaning, it does not allocate any new memory apart from the ones given to it as arguments 9.81Gb.

  • Negative memory (mostly found in self) indicate deallocation. As far as I understand, it is the total extra memory used by that function. The negative sign indicates that the memory is allocated and deallocated by the time the function terminates. In your case, Forward G0 allocates extra 770Kb which then deallocated before it terminates.
  • It is better to identify the max memory consumed by a function rather than the total memory as that forms the bottleneck.

Can someone please verify this? @Anand_Krish, where did you see it in the documentation? It is a bit counterintuitive, not? If a process allocates X and deallocates X, I would assume it would say 0. I don’t understand why it would say -X


def f():
  x : Tensor = g()

here allocation is inside g(), so it is not counted in “self mem”. but deallocation can be attributed to f() or other outer scopes

That’s a great example. Thanks.
So if I wish to know the ‘peak’ memory consumption of a function, I can interprate the total mem as the peak? Or allocations and deallocations might shift that value

Yeah, I think these accumulators don’t consider the order of operations, thus they may sometimes not reflect the peak value, for big code blocks.

Perhaps the following will do it (haven’t tested it):
m0 = cuda.memory_allocated()
peak = cuda.max_memory_allocated() - m0