Why do we do
w -= lr* w.grad
and not something like
w *= lr*w.grad
or
w /= lr*w.grad
does only subtraction lead to best weights?
Why do we do
w -= lr* w.grad
and not something like
w *= lr*w.grad
or
w /= lr*w.grad
does only subtraction lead to best weights?
You can check the wikipedia article on gradient descent for example.
But the substraction is interpreted as moving in the space of weights along the descent direction given by the negative of the gradients.
w /= lr*w.grad
can have a fun interpretation where it is also doing a gradient step in the space of exponentiated values: if you take exponentials, you will recover the gradient descent update with -
.
Multiplication, would be like w += lr * w.grad
in the exponentiated space. Which sounds bad