Dear all,
I am trying to find out what happens when we do matmuls (for example with nn.Linear) within a block annotated with torch.autocast(device_type="cuda", dtype=torch.float16):
e.g. let’s say I have:
x = torch.randn(4, 4).to("cuda")
m = nn.Linear(4, 16)
with torch.autocast(device_type="cuda", dtype=torch.float16):
y = m(x)
print(y.dtype)
What I would like to confirm is that the multiplications happen with fp16 but the accumulation is done using fp32. At least that’s what I expect from the docs and from Train With Mixed Precision - NVIDIA Docs .
However when I run code like the one above I cant find any convincing evidence that the accumulation is happening with fp32. The ouput dtype printed above is fp16 and when I profile similar code with nsys I see the operation ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_tn which, at least by the name, doesn’t mention fp32 (and also the following operations take an fp16 as input without conversion in between).
Maybe ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_tn is internally accumulating into an fp32 and then converting to fp16? How can I debug this further? I’d like to avoid a matmul with fp16 to accumulate the sums because I expect that to have numerical stability issues and it’s not what is advertised.
I am doing this profiling on NVIDIA L4 Tensor Core GPUs (amazon g6 instances) and I am using torch version 2.6.0+cu124