I have been in contact with the author of the paper and he thinks it might be due to differences with the original ResNet implementation, such as applying weight decay to the biases.

I have gone through the pytorch implementation of the weight decay and I have now some doubts I would like to ask:

Is weight decay affecting the batchnorm biases by default?

Taking a look into the current implementation in sgd.py, could anyone explain what is line 85 doing?

For 1. yes it looks like weight decay is being applied on the biases as well as the weights.

For 2. It is multiplying p.data by weight_decay then adding it to d_p (the gradients).

Example:

# 3x3 matrix of ones
xx = nn.Parameter(torch.from_numpy(np.ones((3,3))))
print(xx)
#Parameter containing:
# 1 1 1
# 1 1 1
# 1 1 1
#[torch.DoubleTensor of size 3x3]
# sum the values just so we can backprop a gradient
y = torch.sum(xx**2)
y.backward()
# gradient of size 3x3
d_p = xx.grad.data
print(d_p)
# 2 2 2
# 2 2 2
# 2 2 2
#[torch.DoubleTensor of size 3x3]
# do this operation
weight_decay = 3.0
d_p.add_(weight_decay, xx.data)
print(d_p)
# 5 5 5
# 5 5 5
# 5 5 5
#[torch.DoubleTensor of size 3x3]

You see the matrix was ones, the gradient matrix was twos, then we multiplied the data by weight decay (3.0), then added it to the gradient to get fives.

You can see on stackexchange for a further explanation, where you’re essentially doing the following update: wi←wi−η∂E∂wi−ηλwi. The weight decay is that final term

if you want to filter out weight decay only for biases (i.e. have weight decay for weights, but no weight decay for biases), then you can use the per-parameter optimization options, like described here: http://pytorch.org/docs/optim.html#per-parameter-options