Behaviour of torch.autograd.profiler.profile with multiple iterations

Hi,

Am getting some strange behaviour trying to profile a simple linear layer using the following code.

import torch

class linear_layer(torch.nn.Module):

   def __init__(self, in_size, out_size, dropout):

       super(linear_layer, self).__init__()

       layer = torch.nn.ModuleList()
        
       # Build layer
       lin = torch.nn.Linear(in_size, out_size)
       layer.append(lin)

       # Add dropout
       layer.append(torch.nn.Dropout(dropout))

       self.layer = torch.nn.Sequential(*layer)

   def forward(self, x):

       f = self.layer(x)
       
       return f
   

# Create model and profile
model = linear_layer(1000, 1000, dropout=0.1)
model.cuda()

a = torch.ones([1000, 1000]).float().cuda()

with torch.autograd.profiler.profile(use_cuda=True) as prof:

   for i in range(110):
   
       pred = model(a)

t = prof.table(sort_by='cpu_time_total')

print(t)

For the first 100 (or maybe more depending on hardware) iterations the printed output looks like this, with the time for the addmm op on the CPU being around 30us.


Name                       CPU time        CUDA time            Calls        CPU total       CUDA total
dropout                    30.845us        110.596us                1         30.845us        110.596us
addmm                      30.881us       1362.946us                1         30.881us       1362.946us
addmm                      30.929us       1358.841us                1         30.929us       1358.841us
addmm                      31.018us       1359.863us                1         31.018us       1359.863us
dropout                    31.265us        111.618us                1         31.265us        111.618us

However, the final few iterations (it changes around iteration 100 consistently on my laptop), the time on CPU for the same op jumps to around 1300us.


Name                       CPU time        CUDA time            Calls        CPU total       CUDA total
addmm                    1358.354us       1414.139us                1       1358.354us       1414.139us
addmm                    1359.222us       1501.190us                1       1359.222us       1501.190us
addmm                    1364.973us       1505.280us                1       1364.973us       1505.280us
addmm                    1416.549us       1509.369us                1       1416.549us       1509.369us
addmm                    1435.247us       1357.834us                1       1435.247us       1357.834us
addmm                    1436.379us       1353.729us                1       1436.379us       1353.729us
addmm                    1481.470us       1426.422us                1       1481.470us       1426.422us
addmm                    1506.124us       1504.257us                1       1506.124us       1504.257us
addmm                    1513.627us       1357.819us                1       1513.627us       1357.819us
_fused_dropout           1594.706us        551.936us                1       1594.706us        551.936us
dropout                  1612.827us        553.984us                1       1612.827us        553.984us
addmm                    1950.068us       3250.112us                1       1950.068us       3250.112us

Have no idea why this is happening - any ideas?

It seems the problem isn’t anything specific to the ops in the example above, but rather number of iterations. I get the same effect with the following code:

a = torch.ones([10000, 10000]).cuda()
    
with torch.autograd.profiler.profile(use_cuda=True) as pr:
    for i in range(400):
        b = a * 10.0

t = pr.table(sort_by='cpu_time_total')
print(t)  

This is the printout table from rows 120 to 130 or so. Gives the same sudden jump in CPU time, which then continues to climb slowly until it’s significantly greater than the CUDA time.


Name            CPU time        CUDA time            Calls        CPU total       CUDA total
mul             28.688us       8523.766us                1         28.688us       8523.766us
mul             29.667us       8509.441us                1         29.667us       8509.441us
mul             77.528us       8491.265us                1         77.528us       8491.265us
mul            726.486us       8133.500us                1        726.486us       8133.500us
mul           8383.802us       8525.750us                1       8383.802us       8525.750us
mul           8387.009us       8410.250us                1       8387.009us       8410.250us
mul           8401.606us       8426.500us                1       8401.606us       8426.500us
mul           8404.111us       8478.750us                1       8404.111us       8478.750us

Not sure if the op is actually being done on the CPU, or if this is an anomaly with the profiler…