Setting allow_fp16_reduced_precision_reduction via libtorch

The documentation CUDA semantics — PyTorch 2.5 documentation

implies that in order to set allow_fp16_reduced_precision_reduction to false via C++/libtorch, one must use setAllowFP16ReductionCuBLAS.

Since calling matmul invokes cutlass, does this mean that disabling fp16 reduction is simply not possible via the C++ API?

The attribute is changing the cuBLAS behavior. Could you describe how exactly you are calling matmuls in CUTLASS?

If I run:

import torch

if __name__ == '__main__':
    k = 2
    device = 'cuda'
    for m in [4096]:
        for n in [4096]:
            d = torch.zeros((m, n), dtype=torch.float16, device=device)
            x = torch.zeros((n, k), dtype=torch.float16, device=device)
            y = torch.zeros((m, k), dtype=x.dtype, device=x.device).contiguous()
            torch.matmul(d, x, out=y)

under the profiler

ncu python3 demo.py

I can see that the underlying kernel that gets called is:

cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_128x1_nn_align2

If I modify demo.py in the following way:

import torch

torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False

if __name__ == '__main__':
    k = 2
    device = 'cuda'
    for m in [4096]:
        for n in [4096]:
            d = torch.zeros((m, n), dtype=torch.float16, device=device)
            x = torch.zeros((n, k), dtype=torch.float16, device=device)
            y = torch.zeros((m, k), dtype=x.dtype, device=x.device).contiguous()
            torch.matmul(d, x, out=y)

and run the profiler again

cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_128x1_nn_align2

gets called, again.

However, the documentation seems to imply (I could be wrong on this), that torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False and setAllowFP16ReductionCuBLAS(false) are equivalent, which is the source of my confusion.

Is perhaps libtorch missing setAllowFP16ReductionCutlass()?

Shouldn’t these flags affect the underlying kernel being called?

These CUTLASS kernels are called via cuBLAS. Why do you think this flag is not applied?

If I pass the --accum flag to be either f32 or f16 to the cutlass profiler,

the underlying kernel that gets called also changes (the difference is usually either a lack or inclusion of “_f16”).

I may just be horribly wrong on this, I just assumed that I would also be able to observe something similar in torch matmul.