Hi, I’ve got high absolute error in torch.matmul. Here is my exmaple code.
import torch
from torch import tensor
def main():
torch.cuda.manual_seed(42)
mask = tensor([[0., 0., 0.], [1.0, 0., 0.]], device='cuda')
matrix = torch.randn(3, 3, device='cuda')
print(matrix)
res = torch.matmul(mask, matrix)
print(res)
if __name__ == "__main__":
main()
And the result is
tensor([[ 0.1940, 2.1614, -0.1721],
[ 0.8491, -1.9244, 0.6530],
[-0.6494, -0.8175, 0.5280]], device='cuda:0')
tensor([[ 0.0000, 0.0000, 0.0000],
[ 0.1940, 2.1621, -0.1720]], device='cuda:0')
where I got a high absolute error between 2.1614 and 2.1621. They are expected to be same in the result. But when I move all tensors to cpu, the result is correct. I wonder whether there is a problem in pytorch or cuda backend and how can I fix it.
My Pytorch version is 1.10.1+cu113 and CUDA toolkit version is 11.5.
PS:
As for my running environment, I first produced this issue in WSL2 and I can’t find out its actual cuda toolkit version. Because I just let it use the cuda library provided by the driver in Win. Installing a cuda toolkit of some specific version like 11.3 through apt will cause cuda become unavailable in Pytorch. So I can’t use nvcc to check the version of cuda toolkit in WSL2. Then I reproduced this problem in my server. The gpu of my server is A100 and I run the code in NGC Pytorch 21.11. The cuda toolkit version of my server is 11.5. But I still got the same problem.