Higher variance when evaluating with manual regularization

Hello,
When using something like the following

            for param in net.parameters():
                loss += weight_decay * 0.5 * torch.pow(param, 2).sum()

instead of the weight_decay option of the optimizer, depending on the weight_decay parameter,
i notice different updates after some epochs compared to using the latter method.
These differences ultimaltely lead to a higher variance when evaluating a model over multiple random seeds.
Where do these differences come from and how can i get rid of them?
Thanks in advance!

Hi,

Weight decay in the optimizers is computed “in closed form” by simply doing

grad += weight_decay * weight

(done here for sgd for example).

This will be more numerically stable and will have less floating point errors because it avoid doing the whole backward pass.

So the two are expected to be different but they should not lead to big change in convergence unless your net is big enough that the sum of the squared weights is big enough to have floating point precision problem.
Given what you say, this might actually be happening.
You can quickly check that by trying to do the accumulation in double:

wd_double = 0.
wd_single = 0.
for param in net.parameters():
  wd_double += weight_decay * 0.5 * torch.pow(param.double(), 2).sum()
  wd_single += weight_decay * 0.5 * torch.pow(param, 2).sum()
print("Double precision: ", wd_double)
print("Single precision: ", wd_single)
print("Difference: ", wd_double - wd_single)

I would actually be interested to see what your net is and what this returns?