Weight Decay Implementation

ncullen93 · April 12, 2017, 9:10pm

For 1. yes it looks like weight decay is being applied on the biases as well as the weights.

For 2. It is multiplying p.data by weight_decay then adding it to d_p (the gradients).

Example:

# 3x3 matrix of ones
xx = nn.Parameter(torch.from_numpy(np.ones((3,3))))
print(xx)
#Parameter containing:
# 1  1  1
# 1  1  1
# 1  1  1
#[torch.DoubleTensor of size 3x3]

# sum the values just so we can backprop a gradient
y = torch.sum(xx**2)
y.backward()
# gradient of size 3x3
d_p = xx.grad.data
print(d_p)
# 2  2  2
# 2  2  2
# 2  2  2
#[torch.DoubleTensor of size 3x3]

# do this operation
weight_decay = 3.0
d_p.add_(weight_decay, xx.data)
print(d_p)
# 5  5  5
# 5  5  5
# 5  5  5
#[torch.DoubleTensor of size 3x3]

You see the matrix was ones, the gradient matrix was twos, then we multiplied the data by weight decay (3.0), then added it to the gradient to get fives.

You can see on stackexchange for a further explanation, where you’re essentially doing the following update: wi←wi−η∂E∂wi−ηλwi. The weight decay is that final term