Weight Decay Implementation

prlz77 · April 11, 2017, 7:34am

Hi,

I have reproduced ResNeXt with pytorch on CIFAR and results are always slightly below the original torch implementation .

I have been in contact with the author of the paper and he thinks it might be due to differences with the original ResNet implementation, such as applying weight decay to the biases.

I have gone through the pytorch implementation of the weight decay and I have now some doubts I would like to ask:

Is weight decay affecting the batchnorm biases by default?
Taking a look into the current implementation in sgd.py, could anyone explain what is line 85 doing?

d_p.add_(weight_decay, p.data)

Thank you in advance!

ncullen93 · April 12, 2017, 9:10pm

For 1. yes it looks like weight decay is being applied on the biases as well as the weights.

For 2. It is multiplying p.data by weight_decay then adding it to d_p (the gradients).

Example:

# 3x3 matrix of ones
xx = nn.Parameter(torch.from_numpy(np.ones((3,3))))
print(xx)
#Parameter containing:
# 1  1  1
# 1  1  1
# 1  1  1
#[torch.DoubleTensor of size 3x3]

# sum the values just so we can backprop a gradient
y = torch.sum(xx**2)
y.backward()
# gradient of size 3x3
d_p = xx.grad.data
print(d_p)
# 2  2  2
# 2  2  2
# 2  2  2
#[torch.DoubleTensor of size 3x3]

# do this operation
weight_decay = 3.0
d_p.add_(weight_decay, xx.data)
print(d_p)
# 5  5  5
# 5  5  5
# 5  5  5
#[torch.DoubleTensor of size 3x3]

You see the matrix was ones, the gradient matrix was twos, then we multiplied the data by weight decay (3.0), then added it to the gradient to get fives.

You can see on stackexchange for a further explanation, where you’re essentially doing the following update: wi←wi−η∂E∂wi−ηλwi. The weight decay is that final term

prlz77 · April 13, 2017, 11:06am

Thank you, really nice explanation

smth · April 28, 2017, 11:44am

if you want to filter out weight decay only for biases (i.e. have weight decay for weights, but no weight decay for biases), then you can use the per-parameter optimization options, like described here: http://pytorch.org/docs/optim.html#per-parameter-options

To get the biases out of the model, you can use model.named_parameters()

Here is an example that might help:

github.com

pytorch/pytorch/blob/master/test/test_nn.py#L461-L487


    module.zero_grad()
    self.assertEqual(module.weight.grad.data, module.weight.data.clone().zero_())
    self.assertEqual(module.bias.grad.data, module.bias.data.clone().zero_())


def test_no_grad(self):
    module = nn.Conv2d(2, 5, kernel_size=3, padding=1)
    input = torch.randn(1, 2, 10, 10)
    x = Variable(input)
    y = Variable(input.clone())


    output = module(x)
    self.assertTrue(output.requires_grad)
    output.backward(torch.ones(1, 5, 10, 10))


    with torch.no_grad():
        output2 = module(y)
        self.assertFalse(output2.requires_grad)
        self.assertRaises(RuntimeError, lambda: output2.backward(torch.ones(1, 5, 10, 10)))


def _test_dropout(self, cls, input):

This file has been truncated. show original

prlz77 · April 28, 2017, 5:08pm

I want the opposite (decay everywhere), so the problem is solved Actually your answer helps me in another project.

Thank you!

himanshurobo · May 2, 2019, 7:50am

Well explained… !!!