Problem with float precision

yegane · May 2, 2020, 10:42pm

I have several models, and I want to aggregate them.
I want to compare two methods, that in one of them, I multiply a coefficient to model parameters, and in another I use torch.mean to aggregate.

I have a problem to get comparable results, and I think it is due the precision of float32 in model parameters.

e.g. if I have ten models, that I want to aggregate their models parameters, I could simply multiply them by 0.1 and then use torch.sum, and I expect to get the same result as torch.mean; yet I am not getting the same results. It is my simplest scenario, but I have the same problem in more complex operations that I want to compare.

I tried to convert the elements of operations, which are in float32 to float64, and then do the operation and then convert them back to float32, but I get segmentation error.
e.g. (torch.DoubleTensor(a)*torch.DoubleTesnor(b)).float()
Do you have any suggestion?

ptrblck · May 3, 2020, 4:43am

How large is the difference using float32?

Could you get a gdb backtrace for the segmentation fault?

yegane · May 3, 2020, 12:22pm

difference is small, but in each round, it diverge more, and the final results are completely different, because it affect all model parameters, and they converge to different state. Therefore, the comparison of my methods is not reliable in other settings.

This is my first time with gdb, and I ran:

gdb 
> file python
> run myscript.py

I got the following results. If I should run it differently, please tell, but I got this:

[New Thread 0x7fff55fff700 (LWP 18812)]
[New Thread 0x7fff557fe700 (LWP 18813)]
[New Thread 0x7fff54ffd700 (LWP 18814)]

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__strlen_avx2 () at ../sysdeps/x86_64/multiarch/strlen-avx2.S:62
62      ../sysdeps/x86_64/multiarch/strlen-avx2.S: No such file or directory.

KFrank · May 3, 2020, 4:11pm

Hello Yegane (and @ptrblck)!

To add some pytorch version 0.3.0 color to this discussion, I can’t get
this “use constructor to cast” approach to work either. (But I get a more
informative pytorch error, rather than a segmentation fault.)

(Note, that in terms of your original problem, I can use .double(),
i.e., (a.double() * b.double()).float()) works for me.)

Contrary to the (quite limited) documentation, I cannot construct
a DoubleTensor from a FloatTensor, nor vice versa:

import torch
torch.__version__

tf = torch.FloatTensor (2, 3)
type (tf)
td = torch.DoubleTensor (3, 4)
type (td)

torch.DoubleTensor (tf)
torch.FloatTensor (td)

tfd = tf.double()
type (tfd)
tdf = td.float()
type (tdf)

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> tf = torch.FloatTensor (2, 3)
>>> type (tf)
<class 'torch.FloatTensor'>
>>> td = torch.DoubleTensor (3, 4)
>>> type (td)
<class 'torch.DoubleTensor'>
>>>
>>> torch.DoubleTensor (tf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: torch.DoubleTensor constructor received an invalid combination of arguments - got (torch.FloatTensor), but expected one of:
 * no arguments
 * (int ...)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.FloatTensore[0m)
 * (torch.DoubleTensor viewed_tensor)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.FloatTensore[0m)
 * (torch.Size size)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.FloatTensore[0m)
 * (torch.DoubleStorage data)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.FloatTensore[0m)
 * (Sequence data)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.FloatTensore[0m)

>>> torch.FloatTensor (td)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: torch.FloatTensor constructor received an invalid combination of arguments - got (torch.DoubleTensor), but expected one of:
 * no arguments
 * (int ...)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.DoubleTensore[0m)
 * (torch.FloatTensor viewed_tensor)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.DoubleTensore[0m)
 * (torch.Size size)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.DoubleTensore[0m)
 * (torch.FloatStorage data)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.DoubleTensore[0m)
 * (Sequence data)
      didn't match because some of the arguments have invalid types: (e[31;1mtorch.DoubleTensore[0m)

>>>
>>> tfd = tf.double()
>>> type (tfd)
<class 'torch.DoubleTensor'>
>>> tdf = td.float()
>>> type (tdf)
<class 'torch.FloatTensor'>

Best.

K. Frank

ptrblck · May 4, 2020, 12:20am

Thanks for the addition, @KFrank!
The segfault might be related to this issue and should be fixed per this PR.

You could try to force a specific order via loops, which might be slower, but would probably give you exactly the same results. Also, float64 might be another good idea and I would try @KFrank’s suggestion on transforming the tensors.