Is torch.autocast using mixed precision for matmul?

Dear all,

I am trying to find out what happens when we do matmuls (for example with nn.Linear) within a block annotated with torch.autocast(device_type="cuda", dtype=torch.float16):

e.g. let’s say I have:

x = torch.randn(4, 4).to("cuda")
m = nn.Linear(4, 16)

with torch.autocast(device_type="cuda", dtype=torch.float16):
    y = m(x)
    print(y.dtype)

What I would like to confirm is that the multiplications happen with fp16 but the accumulation is done using fp32. At least that’s what I expect from the docs and from Train With Mixed Precision - NVIDIA Docs .

However when I run code like the one above I cant find any convincing evidence that the accumulation is happening with fp32. The ouput dtype printed above is fp16 and when I profile similar code with nsys I see the operation ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_tn which, at least by the name, doesn’t mention fp32 (and also the following operations take an fp16 as input without conversion in between).

Maybe ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_tn is internally accumulating into an fp32 and then converting to fp16? How can I debug this further? I’d like to avoid a matmul with fp16 to accumulate the sums because I expect that to have numerical stability issues and it’s not what is advertised.

I am doing this profiling on NVIDIA L4 Tensor Core GPUs (amazon g6 instances) and I am using torch version 2.6.0+cu124