If I run:
import torch
if __name__ == '__main__':
k = 2
device = 'cuda'
for m in [4096]:
for n in [4096]:
d = torch.zeros((m, n), dtype=torch.float16, device=device)
x = torch.zeros((n, k), dtype=torch.float16, device=device)
y = torch.zeros((m, k), dtype=x.dtype, device=x.device).contiguous()
torch.matmul(d, x, out=y)
under the profiler
ncu python3 demo.py
I can see that the underlying kernel that gets called is:
cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_128x1_nn_align2
If I modify demo.py in the following way:
import torch
torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False
if __name__ == '__main__':
k = 2
device = 'cuda'
for m in [4096]:
for n in [4096]:
d = torch.zeros((m, n), dtype=torch.float16, device=device)
x = torch.zeros((n, k), dtype=torch.float16, device=device)
y = torch.zeros((m, k), dtype=x.dtype, device=x.device).contiguous()
torch.matmul(d, x, out=y)
and run the profiler again
cutlass_80_wmma_tensorop_f16_s161616gemm_f16_32x32_128x1_nn_align2
gets called, again.
However, the documentation seems to imply (I could be wrong on this), that torch.backends.cuda.matmul.allow_fp16_reduced_precision_reduction = False and setAllowFP16ReductionCuBLAS(false) are equivalent, which is the source of my confusion.
Is perhaps libtorch missing setAllowFP16ReductionCutlass()?