Bug? matmul seems to cast to float16 internally

tjak · October 27, 2021, 1:10pm

What’s going on here? Comments will be appreciated.

[[1.0]] multiplied by [[1.0001]] incorrectly results in 1., whereas [1.0] multiplied by [1.0001] correctly results in 1.0001 on Win10.

It seems as if the internal accumulator is only float16. It works with float64, or without using CUDA.

Cannot reproduce on Ubuntu machine.

Code
import torch

dtype = torch.float32
A = torch.tensor([[1.]], dtype=dtype).cuda()
B = torch.tensor([[1.0001]], dtype=dtype).cuda()
test1 = torch.matmul(A, B)

A = torch.tensor([1.], dtype=dtype).cuda()
B = torch.tensor([1.0001], dtype=dtype).cuda()
test2 = torch.matmul(A, B)

print(test1)
print(test2)

print(torch.version.cuda)
print(torch.version)

Output
tensor([[1.]], device=‘cuda:0’)
tensor(1.0001, device=‘cuda:0’)
Cuda v11.3
Torch v1.10.0+cu113

tjak · October 27, 2021, 2:05pm

It’s running on an RTX 3080.

KFrank · October 27, 2021, 3:43pm

Hi Thomas!

FYI, I also cannot reproduce this on Ubuntu using a
“GeForce GTX 1050 Ti”.

>>> import torch
>>> dtype = torch.float32
>>> A = torch.tensor([[1.]], dtype=dtype).cuda()
>>> B = torch.tensor([[1.0001]], dtype=dtype).cuda()
>>> test1 = torch.matmul(A, B)
>>> A = torch.tensor([1.], dtype=dtype).cuda()
>>> B = torch.tensor([1.0001], dtype=dtype).cuda()
>>> test2 = torch.matmul(A, B)
>>> print(test1)
tensor([[1.0001]], device='cuda:0')
>>> print(test2)
tensor(1.0001, device='cuda:0')
>>> print(torch.version.cuda)
10.2
>>> print(torch.__version__)
1.9.0
>>> print (torch.cuda.get_device_name (0))
GeForce GTX 1050 Ti

$ uname -a
Linux server 5.11.0-38-generic #42~20.04.1-Ubuntu SMP Tue Sep 28 20:41:07 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

Best.

K. Frank

tom · October 27, 2021, 3:54pm

I can confirm this happening for me, too.

Workaround: torch.backends.cuda.matmul.allow_tf32 = False, but it doesn’t look good.

Best regards

Thomas

tom · October 27, 2021, 4:34pm

So I ran this by people who know more than me (i.e. everyone else), and @ngimel kindly explained that this is the expected behaviour for tf32 arithmetic, which only uses 10 bits of mantissa.

There is an issue that discusses the loss of precision here: The matrix multiplication operator can't get correct results on 3090 !! · Issue #61890 · pytorch/pytorch · GitHub

Best regards

Thomas

tjak · October 28, 2021, 10:40am

Excellent answer, thank you very much for taking your time. The workaround you posted solved everything. And it explains what is going on.

I think it’s false advertising that someone named the format tf32, though. Should be tf19 really.

Best regards,
Thomas

tom · October 28, 2021, 12:13pm

Hi Thomas,

in the meantime Natalia opened a RFC issue: RFC: Should matmuls use tf32 by default? · Issue #67384 · pytorch/pytorch · GitHub .

I want to add: Thanks for bringing it up! It’s important that users write about issues they see for the PyTorch developers to get a better idea of how intuitive default behaviour like this is and where to strike the balance between “maximizing perfomance” and “avoiding surprises”.

Best regards

Thomas