Illogical ( or unintuitive ) PyTorch Profiler output

The profiler sometimes produces hard to comprehend, even ( seemingly ) illogical, outputs. Consider the following example:

import torch
import torch.nn as nn

from torch.profiler import ProfilerActivity
from torch.profiler import profile
from torch.profiler import record_function

layer1 = nn.Conv3d( 1, 32, 3, 1, 1 )
layer2 = nn.Conv3d( 32, 32, 3, 1, 1 )

x = torch.randn( 1, 1, 256, 256, 1 )

with profile( activities = [ ProfilerActivity.CPU ], record_shapes = True, profile_memory = True ) as prof:
    with record_function( "model run" ):
        with torch.autocast( "cpu", torch.bfloat16, True ):
            res = layer1( x )
            res = layer2( res )

print( prof.key_averages().table( sort_by = "cpu_time_total" ) )

On my machine, this results in:

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 aten::conv3d         0.00%      64.000us       199.98%       25.887s        6.472s      16.18 Mb           0 b             4  
                    model run         0.01%     901.000us       100.00%       12.945s       12.945s       8.18 Mb        -128 b             1  
            aten::convolution         0.00%      97.000us        99.99%       12.943s        6.472s       8.00 Mb           0 b             2  
           aten::_convolution         0.00%      52.000us        99.99%       12.943s        6.472s       8.00 Mb           0 b             2  
            aten::slow_conv3d         0.00%      26.000us        99.99%       12.943s        6.472s       8.00 Mb           0 b             2  
    aten::slow_conv3d_forward        99.98%       12.942s        99.98%       12.943s        6.472s       8.00 Mb    -111.38 Mb             2  
                  aten::copy_         0.01%       1.518ms         0.01%       1.518ms     216.857us           0 b           0 b             7  
                     aten::to         0.00%      16.000us         0.01%     808.000us     161.600us     183.81 Kb           0 b             5  
               aten::_to_copy         0.00%      80.000us         0.01%     792.000us     158.400us     183.81 Kb           0 b             5  
                aten::resize_         0.00%      43.000us         0.00%      43.000us      21.500us       8.00 Mb       8.00 Mb             2  
                aten::reshape         0.00%      33.000us         0.00%      42.000us      21.000us           0 b           0 b             2  
          aten::empty_strided         0.00%      40.000us         0.00%      40.000us       8.000us     183.81 Kb     183.81 Kb             5  
                  aten::empty         0.00%      32.000us         0.00%      32.000us       8.000us     111.38 Mb     111.38 Mb             4  
                   aten::view         0.00%      16.000us         0.00%      16.000us       8.000us           0 b           0 b             2  
         aten::_reshape_alias         0.00%       9.000us         0.00%       9.000us       4.500us           0 b           0 b             2  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 12.945s

This raises several questions:

  1. How can an operation make up 199.98% cpu time? Is this a bug or a feature?

  2. Generally, shouldn’t “model run”, being the top level execution context, always make up all of cpu time? After all, every other operation can be considered a child call. Well, in this case, it actually is at 100%, ignoring the aten::conv3d issue, but I have also seen cases where it was less.

  3. The memory statistics are even harder to understand. They are actually not well described on the doc and recipe pages. As I understand it, they should give the net amount of memory allocated ( respectively freed ) within itself ( Self CPU Mem ) and within itself and its child calls ( CPU Mem ) over the whole profiling run. Then however, shouldn’t “model run” again have the highest value, since it encompasses all further operations?

  4. While net allocation statistics may have their purpose, a probably more useful metric would be peak allocation. For cuda devices such functions do exist, like torch.cuda.max_memory_allocated, but are unfortunately missing for cpu backends.

On a sidenote: Considering the given data sizes in the example, the units should probably say MiB rather then Mb, even so the difference may not be gigantic