Float16 not faster than float32 with torch.matmul

Below is an example code and the benchmarking results. I am constantly dealing with tensors of sizes listed in the example code. I find that torch.float16 is barely faster than torch.float32 for batched matmul. Is torch.float16 casted to torch.float32 in the intermediate steps? How do I speed up matmul for torch.float16?

I am running on an A2 gpu, with torch version ‘2.2.2+cu121’.

import torch
import torch.utils.benchmark as benchmark

torch.backends.cuda.matmul.allow_tf32 = False

n = 1024
p = 8
m = 2048
b = 32
device = "cuda"
for dtype in [torch.float16, torch.float32]:
    a1 = torch.rand(b, n, p, device=device, dtype=dtype)
    a2 = torch.rand(b, p, m, device=device, dtype=dtype)

    t0 = benchmark.Timer(
            stmt='matmul(a1, a2)',
            setup='from torch import matmul',
            globals={'a1': a1, 'a2': a2})

    total = t0.timeit(100)
    print('\ndtype:',dtype)
    print(total)

Benchmarking results:

dtype: torch.float16
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8233925060>
matmul(a1, a2)
setup: from torch import matmul
  1.61 ms
  1 measurement, 100 runs , 1 thread

dtype: torch.float32
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8233924df0>
matmul(a1, a2)
setup: from torch import matmul
  1.64 ms
  1 measurement, 100 runs , 1 thread

Changing to torch.backends.cuda.matmul.allow_tf32 = True gives similar benchmarking results

It seems cuBLAS is not able to pick a faster kernel resulting in almost the same performance for both workloads. I do see a clear speedup in float16 vs. float32 on A100 so will check if it’s a heuristics issue on the A2.

What kind of model are these shapes used in?

Hi,

Thanks for the reply, I am not training models, but rather using pytorch to do signal processing with linear algebra techniques involving SVDs.

How much speed up did you see on the A100? Is float16 generally expected to be twice as fast as float32?