Hello smani -

First, could you please post text rather than screenshots?

(It makes things searchable, lets screen readers work, etc.)

If you need to post an equation that you canâ€™t format adequately

using the forumâ€™s markdown, you could post a screenshot of

that, but do then refer to and describe the equation in your text.

I havenâ€™t verified your analytic computation of the gradient;

Iâ€™ll take your word for it.

But I think the issue is the following:

You say â€śIn my understanding when the loss is zero there should

not be any updates.â€ť In general, this isnâ€™t correct. Itâ€™s true that we

often work with loss functions that are non-negative (never less

than zero), and often â€śloss = 0â€ť implies â€śtrue solution found.â€ť But

the value of the loss function has no particular meaning in the

pytorch framework. It is just something you minimize to train

your network. Nothing prevents your loss function from having

a minimum of less than zero, and nothing prevents pytorch from

finding that minimum.

In gradient descent, the (negative of the) gradient tells you the

direction to move in parameter space to make your loss smaller.

(That is, algebraically smaller â€“ less positive, which is the same

as more negative.)

Your example illustrates this general point. Your loss function

has the constant lambda in it. But the locations of any minima

of your loss function donâ€™t depend on lambda, the gradient is

independent of lambda, and the optimization algorithm

(presumably gradient descent) doesnâ€™t care about lambda.

If your loss function happens to be zero for some particular

value of lambda, training wonâ€™t stop, nor should it â€“ the fact

that your loss was zero was merely an artifact of that particular

value of lambda.

Good luck.

K. Frank