I am dazzled about all the pieces of tutorials from any kind of websites. How SGD was implemented in PyTorch?
lr: learning rate
w: weights
dw: the grad of weights
Is this equation w'= momentum * w - lr * (dw + weight_decay * w)
right?
Or this v=momentum * v(t-1) + (dw + weight_decay * w)
, then w = w - lr * v
, here v(t-1)
means the last time v
.