Context
After observing slower training (by logging.) using autocast, a profiling was run to check for expensive operations.
This shows CPU results, but using T4s (GPU) in Colab, bfloat16 takes very long (just like float16 does in the CPU below.). However, float16 does run faster than float32. (Apparently the reason for this is that T4s do not support bfloat16. I’m unsure if this is correct, but seems the case in Colab.)
But the issue here is mainly for CPUs.
- Hardware: CPU - 11th Gen Intel(R) Core™ i7-1165G7 @ 2.80GHz (aka Intel CORE i7 11th)
Results
Running a profiler on inference (on training mode) using an autoencoder, with a snippet like:
with (
profile(
activities=[
ProfilerActivity.CPU,
],
record_shapes=True,
profile_memory=True,
) as prof,
torch.autocast(get_device_name(), torch.float16),
record_function("model_inference"),
):
model(img)
print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=3))
- Using 500 images on a network of 10Mb weights
It outputs, for each case, sorted by total cpu time:
float16
DEBUG:root:Device cpu
INFO:root:-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::linear 0.00% 49.990us 159.06% 2.392s 597.946ms 46.11 Mb 0 b 4
model_inference 0.16% 2.472ms 100.00% 1.504s 1.504s 7.44 Mb -250.91 Mb 1
aten::addmm 79.36% 1.193s 79.49% 1.195s 597.630ms 19.34 Mb 19.34 Mb 2
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 1.504s
float 32:
DEBUG:root:Device cpu
INFO:root:-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
model_inference 4.88% 8.560ms 100.00% 175.375ms 175.375ms 0 b -495.96 Mb 1
aten::convolution 0.06% 107.845us 52.11% 91.393ms 15.232ms 107.55 Mb 0 b 6
aten::_convolution 32.91% 57.722ms 52.05% 91.285ms 15.214ms 107.55 Mb 0 b 6
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 175.375ms
bfloat16
DEBUG:root:Device cpu
INFO:root:-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg CPU Mem Self CPU Mem # of Calls
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::conv_transpose2d 0.02% 58.300us 119.44% 453.313ms 75.552ms 37.46 Mb 0 b 6
model_inference 0.90% 3.428ms 100.00% 379.540ms 379.540ms 7.44 Mb -250.91 Mb 1
aten::convolution 0.03% 123.037us 71.21% 270.288ms 45.048ms 53.78 Mb 0 b 6
-------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 379.540ms
Both conv and transposed convolution should be faster in half precision, but they are slower than float. Any ideas what it going wrong?