Why adding bias gives numerically inconsistent result?

Abhigyan_Raman · February 21, 2021, 1:14pm

I have noticed a disturbing pattern with pytorch (and other libraries too) but can’t get to the bottom of it!
Any operation which involves bias term gives different (wrong results!) I will illustrate with Conv2d module here!

Without Bias term!

import torch
import numpy as np
torch.set_printoptions(precision=32)
np.random.seed(23)

# Creating random datapoints of Batch size=5
x_data = np.random.random(size=(5,1,1,1)).astype('float32')
x = torch.as_tensor(x_data)

# Defining Network
net = torch.nn.Conv2d(in_channels=1,out_channels=1,kernel_size=1)

# Initializing random weights and zero biases
weight = np.random.random(size=(1,1,1,1)).astype('float32')
bias = np.zeros(shape=(1,)).astype('float32')

from collections import OrderedDict
parameters = OrderedDict()

parameters['weight'] = torch.as_tensor(weight)
parameters['bias'] = torch.as_tensor(bias)

# Loading the network parameters with custom W & B
net.load_state_dict(parameters)

output1 = net(x)

With Bias term

import torch
import numpy as np
torch.set_printoptions(precision=32)
np.random.seed(23)

# You can reinitialize x here or use the same x from before. Won't make a difference!
x_data = np.random.random(size=(5,1,1,1)).astype('float32')
x = torch.as_tensor(x_data)

net = torch.nn.Conv2d(in_channels=1,out_channels=1,kernel_size=1)

# Here too you can use the same weights as before, bias is randomly initialized!
weight = np.random.random(size=(1,1,1,1)).astype('float32')
bias = np.random.random(size=(1,)).astype('float32')

from collections import OrderedDict
parameters = OrderedDict()

parameters['weight'] = torch.as_tensor(weight)
parameters['bias'] = torch.as_tensor(bias)

net.load_state_dict(parameters)

output2 = net(x)

# You can even print the actual values and notice the difference!
print(output2-output1 == torch.as_tensor(bias))

My outputs are:

tensor([[[[ True]]],
        [[[False]]],
        [[[False]]],
        [[[ True]]],
        [[[False]]]])

This True-False sequence is totally random. Choose a different value of seed or batch_size, you will get a different result.

Although the error introduced by bias term is very small, but still it exists. Not able to know the reason is very frustrating. If you know the reason (and/or workaroud to avoid it), please help! Thanks!

Note: I have observed the same randomness in other modules like BatchNorm and other libraries like PaddlePaddle. Also I am running this on CPU.

tom · February 21, 2021, 1:40pm

If you print the difference:
print(output2-output1 - torch.as_tensor(bias))
you see that it is 1e-8ish, so this is an effect of numerical precision.

With floating points, you cannot expect two “algebraically equivalent” (i.e. the maths say it should be the same but calculated differently) to give exactly the same result, but the errors in the computation make it slightly different.

This gets even more tricky when parallel computation is involved which will not even necessarily compute the same result when run twice because the order might be “random” (look for reproducible/deterministic in the PyTorch documentation, the CUDA atomicAdd command with floating point is perhaps the most common source of this in PyTorch).

The most simple example perhaps is running 1e32+1-1e32 this is not the same as 1e32-1e32+1.

Best regards

Thomas

Abhigyan_Raman · February 21, 2021, 1:47pm

Thanks for the quick reply and nice explanation! So, there is no way of getting the same result in floating point operations. Although the error because of numerical precision is negligible here, I wonder won’t it affect the output by big models with billions of FLOPs.

tom · February 21, 2021, 2:59pm

Whether or not this is a problem depends a lot:

Convolutions and linear layers and other reductions “average” errors in the input. If these are independent, then the errors won’t accumulate.
For other operations, this is much more tricky, e.g. taking exponentials in softmax to get probabilities and then log (e.g. for KL divergence / “cross entropy” etc.) can run into precision problems, in particular with very small probabilities.
Finally, while the ops in a single model run often are not problematic, this becomes different when you train models in millions or billions of iterations: in the course of a training run, these errors can accumulate quite dramatically, so the differences may mean that you cannot expect to reproduce the same weights as the end results when running the exact same code. This is quite annoying for reproducing scientific results etc.