Unexpected output when extending nn.Linear with new output units

Hi, I am trying to extend an exisiting trained nn.Linear() layer by new output units, as I want to train it for new classes as well. So the idea is to simply add output units to the existing ones. However, in my current implementation (see below) the output for the existing classes is not exactly the same as from the initial model, although the weights and biases are copied from before.
When extending the network from 10 to 20 classes, it only ever effects the classes 9 and 10 not any other of the old classes 1 to 8.
The difference between outputs is very small like 1e-8, but still unexpected.
Is this just some float precision error, or is my approach not working as intended?

import torch

in_channels = 64
n_classes_a = 10
n_classes_b = 20
random_input = torch.randn(2, 64)
a_layer = torch.nn.Linear(in_features=in_channels, out_features=n_classes_a, bias=True)
b_layer = torch.nn.Linear(in_features=in_channels, out_features=n_classes_b, bias=True)

# copying weight and bias for old classes
new_state_dict = b_layer.state_dict()
old_state_dict = a_layer.state_dict()
new_state_dict["weight"][:n_classes_a,] = old_state_dict["weight"]
new_state_dict["bias"][:n_classes_a,] = old_state_dict["bias"]
b_layer.load_state_dict(new_state_dict)

random_input = torch.randn(2, 64)
out_a = a_layer(random_input)
out_b = b_layer(random_input)
print(out_b[:,:n_classes_a] - out_a)

Example output:

tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00, -5.9605e-08],
        [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  8.9407e-08,  1.4901e-08]],
       grad_fn=<SubBackward0>)

Any help is really appreciated :slight_smile:

Hi Tobias!

As you suspect, this is very likely floating-point round-off error.

As the size of a tensor changes, pytorch can change the details of
how it pipelines tensor operations, both on the cpu and gpu. This
can cause operations to be reordered in a way that is mathematically
equivalent, but that produces differing round-off errors.

You could check this by repeating your experiment in double-precision.
I would then expect the discrepancy to be reduced by several orders of
magnitude.

Note that the details of when and how the pipelining changes with
tensor size will differ between the single-precision and double-precision
cases, so you might have to play around with the tensor sizes to see
a similar effect.

Best.

K. Frank

1 Like

Thank you very much for the answer and explanation!
I just tested it with double-precision and indeed the error is much smaller now, only in the range of 1e-17.
So this is just neglectable precision error and not an error with my code.