Second initialisation of bias parameter has no effect on return value of forward on GPU, but does on CPU

In trying to implement data-dependent initialisation for weight-normalisation, I’ve run into a confusing behaviour with parameter initialization.

I’ve boiled it down to a minimal example. Let’s say that for a linear layer I do the following:

  1. Initialize the bias to 0.0.
  2. Run the first forward pass.
  3. Initialize the bias to 1.0.
  4. Run the second forward pass.

I would then expect to see the average value of the second forward pass be exactly 1 larger than the one of the first.

On CPU, this is indeed the case. On GPU however, I get the same output in both forward passes, regardless of the value of the bias. You can inspect the module to see that the bias parameter has indeed been updated and is indeed located on the GPU.

import torch
import torch.nn as nn


B = 2
C = 2
I = 256
O = 16

x = torch.randn(B, C, I)

module = nn.Linear(I, O)

# On CPU, changing the bias affects the output.
# On GPU, changing the bias does not affect the output.
for device in [torch.device("cpu"), torch.device("cuda:0")]:

    print(device)

    x = x.to(device)
    module = module.to(device)

    # Set bias to 0.0
    nn.init.constant_(module.bias, 0.0)

    o = module(x)
    print(o.mean().item(), o.std().item())

    # Set bias to 1.0
    nn.init.constant_(module.bias, 1.0)

    o = module(x)
    print(o.mean().item(), o.std().item())

I can’t reproduce the issue and get the expected results:

cpu
-0.06411010026931763 0.5944567918777466
0.9358899593353271 0.5944567918777466
cuda:0
-0.06415127217769623 0.5944625735282898
0.9358487129211426 0.5944625735282898