Why do we do subtraction in optimizer?

Why do we do

w -= lr* w.grad

and not something like

w *= lr*w.grad

or

w /= lr*w.grad

does only subtraction lead to best weights?

You can check the wikipedia article on gradient descent for example.
But the substraction is interpreted as moving in the space of weights along the descent direction given by the negative of the gradients.

w /= lr*w.grad can have a fun interpretation where it is also doing a gradient step in the space of exponentiated values: if you take exponentials, you will recover the gradient descent update with -.

Multiplication, would be like w += lr * w.grad in the exponentiated space. Which sounds bad :smiley:

1 Like