I am dazzled about all the pieces of tutorials from any kind of websites. How SGD was implemented in PyTorch?

lr: learning rate

w: weights

dw: the grad of weights

Is this equation `w'= momentum * w - lr * (dw + weight_decay * w)`

right?

Or this `v=momentum * v(t-1) + (dw + weight_decay * w)`

, then `w = w - lr * v`

, here `v(t-1)`

means the last time `v`

.