No speedup with autocast?

jayz · April 4, 2022, 4:37pm

Hey,

so, I am trying to speed up my code using fp16 computation, with autocast.

I wrote two small code examples (with and without autocast).

with:

from torch.cuda.amp import autocast
 
%%time
for batch,(X,y) in enumerate(dltr):
    with autocast():
        pred = model(X)
print('Memory Allocated: %0.2fMB, Memory Reserved: %0.2fMB \n'\
              %(torch.cuda.memory_allocated()/1e6, torch.cuda.memory_reserved()/1e6))

and without:

%%time
for batch,(X,y) in enumerate(dltr):
    pred = model(X)
print('Memory Allocated: %0.2fMB, Memory Reserved: %0.2fMB \n'\
              %(torch.cuda.memory_allocated()/1e6, torch.cuda.memory_reserved()/1e6))

Comparing the performance of these two timewise and memorywise yields these outputs:

with autocast:

Memory Allocated: 389.42MB, Memory Reserved: 2466.25MB 
Wall time: 26.8 s

without:

Memory Allocated: 689.19MB, Memory Reserved: 2466.25MB 
Wall time: 26.9 s

The input shape of X is (1024, 5, 5, 512).

As expected, using autocast yields (very) roughly a halving of the memory, but no speedup. I also tested this using multiple timing loops (%%timeit). In all test runs, not using autocast is approximately the same speed. Am I doing sth. wrong here, or is it just that this does not always yield a considerable speedup?

Given these results, it seems, the only thing I can expect is to run my model with twice the batch size as before, but training will take the same time.

My model contains a few dozen convolutions (with kernel size 2 and mostly up to 8 channels)., The total number of parameters is about 2000. Maybe the model is just very small and a speedup becomes only significant for larger kernel dimensions or channel numbers?

Thanks! Best, JZ

jayz · April 4, 2022, 4:46pm

So, I just discovered the pytorch profiler and ran it:

with profile(activities=[
        ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True) as prof:
    with record_function("model"):
        model(X,w=None,r=None)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))

This gives me:

-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                         Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                        model         1.55%      11.558ms        61.78%     461.551ms     461.551ms      20.000ms         4.44%     450.704ms     450.704ms             1  
                 aten::conv2d         0.14%       1.023ms         1.87%      13.993ms     269.096us     615.000us         0.14%      80.720ms       1.552ms            52  
           aten::_convolution         0.18%       1.338ms         1.63%      12.208ms     234.769us       1.672ms         0.37%      79.628ms       1.531ms            52  
            aten::convolution         0.10%     733.000us         1.66%      12.366ms     247.320us     452.000us         0.10%      79.513ms       1.590ms            50  
              aten::clamp_min         0.33%       2.444ms         0.85%       6.356ms      63.560us      34.714ms         7.70%      69.452ms     694.520us           100  
                    aten::add         0.52%       3.903ms         0.52%       3.903ms      31.224us      58.502ms        12.98%      58.502ms     468.016us           125  
                    aten::var         0.35%       2.631ms         0.65%       4.876ms      97.520us      52.102ms        11.56%      53.047ms       1.061ms            50  
      aten::cudnn_convolution         0.73%       5.438ms         1.03%       7.684ms     147.769us      43.358ms         9.62%      44.504ms     855.846us            52  
             aten::avg_pool2d         0.11%     842.000us         0.11%     842.000us      30.071us      36.522ms         8.10%      36.522ms       1.304ms            28  
                   aten::relu         0.14%       1.022ms         0.66%       4.957ms      99.140us     658.000us         0.15%      35.963ms     719.260us            50  
-----------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 747.141ms
Self CUDA time total: 450.726ms

It seems the model spends a considerable amount of time doing stuff on the cpu, which is why autocast probably does not really deliver a speedup. However, I don’t understand why time is spend on the cpu, because all the operations in forward are tensor operations on the gpu. Does anybody know typical reasons for this issue?

Thanks!

suraj.pt · April 4, 2022, 8:39pm

Hi @jayz

What is the GPU you’re running this on? Speedups from mixed precision is most evident on TensorCore GPUs (like Volta, Tesla, Ampere).

jayz · April 5, 2022, 5:35am

Hey Suraj,

this is just a notebook graphics card:
NVIDIA GeForce GTX 1650 Ti
It does not have tensor cores, which is probably the answer.

Thanks!

Best, JZ