Hi, I’ve got high absolute error in torch.matmul. Here is my exmaple code.
import torch from torch import tensor def main(): torch.cuda.manual_seed(42) mask = tensor([[0., 0., 0.], [1.0, 0., 0.]], device='cuda') matrix = torch.randn(3, 3, device='cuda') print(matrix) res = torch.matmul(mask, matrix) print(res) if __name__ == "__main__": main()
And the result is
tensor([[ 0.1940, 2.1614, -0.1721], [ 0.8491, -1.9244, 0.6530], [-0.6494, -0.8175, 0.5280]], device='cuda:0') tensor([[ 0.0000, 0.0000, 0.0000], [ 0.1940, 2.1621, -0.1720]], device='cuda:0')
where I got a high absolute error between 2.1614 and 2.1621. They are expected to be same in the result. But when I move all tensors to cpu, the result is correct. I wonder whether there is a problem in pytorch or cuda backend and how can I fix it.
My Pytorch version is 1.10.1+cu113 and CUDA toolkit version is 11.5.
As for my running environment, I first produced this issue in WSL2 and I can’t find out its actual cuda toolkit version. Because I just let it use the cuda library provided by the driver in Win. Installing a cuda toolkit of some specific version like 11.3 through apt will cause cuda become unavailable in Pytorch. So I can’t use nvcc to check the version of cuda toolkit in WSL2. Then I reproduced this problem in my server. The gpu of my server is A100 and I run the code in NGC Pytorch 21.11. The cuda toolkit version of my server is 11.5. But I still got the same problem.