Slow convolutions on CPU with autocast

Context

After observing slower training (by logging.) using autocast, a profiling was run to check for expensive operations.

This shows CPU results, but using T4s (GPU) in Colab, bfloat16 takes very long (just like float16 does in the CPU below.). However, float16 does run faster than float32. (Apparently the reason for this is that T4s do not support bfloat16. I’m unsure if this is correct, but seems the case in Colab.)

But the issue here is mainly for CPUs.

  • Hardware: CPU - 11th Gen Intel(R) Core™ i7-1165G7 @ 2.80GHz (aka Intel CORE i7 11th)

Results

Running a profiler on inference (on training mode) using an autoencoder, with a snippet like:

    with (
      profile(
        activities=[
          ProfilerActivity.CPU,
        ],
        record_shapes=True,
        profile_memory=True,
      ) as prof,
      torch.autocast(get_device_name(), torch.float16),
      record_function("model_inference"),
    ):
      model(img)
    print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=3))
  • Using 500 images on a network of 10Mb weights

It outputs, for each case, sorted by total cpu time:

float16

DEBUG:root:Device cpu 
INFO:root:--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                    aten::linear         0.00%      49.990us       159.06%        2.392s     597.946ms      46.11 Mb           0 b             4  
                 model_inference         0.16%       2.472ms       100.00%        1.504s        1.504s       7.44 Mb    -250.91 Mb             1  
                     aten::addmm        79.36%        1.193s        79.49%        1.195s     597.630ms      19.34 Mb      19.34 Mb             2  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.504s

float 32:

DEBUG:root:Device cpu
INFO:root:--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                 model_inference         4.88%       8.560ms       100.00%     175.375ms     175.375ms           0 b    -495.96 Mb             1  
               aten::convolution         0.06%     107.845us        52.11%      91.393ms      15.232ms     107.55 Mb           0 b             6  
              aten::_convolution        32.91%      57.722ms        52.05%      91.285ms      15.214ms     107.55 Mb           0 b             6  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 175.375ms

bfloat16

DEBUG:root:Device cpu
INFO:root:--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                            Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg       CPU Mem  Self CPU Mem    # of Calls  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
          aten::conv_transpose2d         0.02%      58.300us       119.44%     453.313ms      75.552ms      37.46 Mb           0 b             6  
                 model_inference         0.90%       3.428ms       100.00%     379.540ms     379.540ms       7.44 Mb    -250.91 Mb             1  
               aten::convolution         0.03%     123.037us        71.21%     270.288ms      45.048ms      53.78 Mb           0 b             6  
--------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 379.540ms

Both conv and transposed convolution should be faster in half precision, but they are slower than float. Any ideas what it going wrong?

That was run within a Docker container, but in a standard system install the results are almost the same.

Just forgot that i commented in a related post and gh issue here Autocast on cpu dramatically slow - #3 by Laurent1

No solution yet apparently. I’ll test what happens on T4s…

Also, issue not be autocast but bf16, related issue below: