Hello,
The following is a minimum working example of the problem that I have come across:
import torch
import os
import numpy as np
import random
torch.use_deterministic_algorithms(True)
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":16:8"
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
import torch.nn as nn
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
x = torch.randn(28, device = "cuda", dtype=torch.float)
y = torch.randn(28, device = "cuda", dtype=torch.float)
my_dot = torch.dot(x, y)/torch.linalg.norm(y)
cos = nn.CosineSimilarity(dim = 0, eps = 0)
cos_dot = torch.linalg.norm(x) * cos(x,y)
print(my_dot.item())
print(cos_dot.item())
The output to this snippet on my system is the following -
0.15492278337478638
0.15492276847362518
They are different in the later decimal places, but both must be the same ideally.
When I cast x
and y
to double using the following lines instead of the above declaration like so -
x = torch.randn(28, device = "cuda", dtype=torch.float).double()
y = torch.randn(28, device = "cuda", dtype=torch.float).double()
I get the following outputs which are same and expected.
0.15492288182677755
0.15492288182677755
Why are they same in double
precision but different in float
precision?
Thanks!
The difference is ~1e-8 and is expected for float32
due to the limited floating point precision and a potentially different order or operations.
Thanks for the explanation, I understand. However, I have a follow up question, when I do the instantiation of x
and y
like the following, i.e. using double
to initialize, I still get different answers for the cosine similarity calculated in 2 different ways - the code is the following, seeds are set same as above -
x = torch.randn(28, device = "cuda", dtype=torch.double)
y = torch.randn(28, device = "cuda", dtype=torch.double)
my_dot = torch.dot(x, y)/torch.linalg.norm(y)
cos = nn.CosineSimilarity(dim = 0, eps = 0)
cos_dot = torch.linalg.norm(x) * cos(x,y)
print(my_dot.item())
print(cos_dot.item())
The output of the above snippet is -
-0.139646650365121
-0.13964665036512092
I know that this is a small difference, but nevertheless it is causing my gradients (in my original code) to be nonzero which is causing my back propagation to diverge.
Please let me know if this is expected and if I am missing something.
Thanks again!
Increasing the bits in the numerical format will give you more precision (the new error is at ~1e-17) but will still be limited.
I would suggest to check your actual requirement (negative gradients) and maybe to apply a small eps
value to your calculation or so. You should not expect to get more precision that what’s possible in the current numerical format.
What does numerical format exactly mean in the context of pytorch?
PyTorch uses float32
(i.e. floating point numbers stored in 32 bits) as its default and allows users to use also wider types with more bits (and thus range and precision) such as float64
as well as smaller types such as float16
or bfloat16
.
E.g. take a look at this Wikipedia article about float32
which is also called “single-precision” float for more general information about this format and the precision limitations.
The “precision” section might be interesting for you and you could play around with some information about the rounding behavior of this numerical format.
E.g.:
Precision limitations on integer values - Integers between 2**24
and 2**25
round to a multiple of 2 (even number)
can be seen as:
x = torch.tensor(2**24, dtype=torch.float32)
print(x)
# tensor(16777216.)
print(x + 1)
# tensor(16777216.)
print(x + 2)
# tensor(16777218.)
As you can see, 16777217
is not representable in float32
since the precision limits increase the larger the interval gets.
The round-off errors you are seeing are explained e.g. in this article with a few examples.
1 Like