What happens when loss are negative?

Hello Brandon!

I would not recommend doing things this way (or thinking about it
this way). The best approach – because network training relies
so heavily on gradient-based optimization – is to make your loss
function differentiable.

Consider the softmax function (which should more properly be
called the “soft-argmax” function): It can be understood as a
differentiable approximation to the argmax function (which is
not differentiable because it jumps around discretely).

(In a similar vein, the sigmoid function can be considered a
differentiable version of the step function.)

So you should look at your loss function, and try to find a sensible
differentiable replacement for it that has more or less the same
structure.

You could view what you call the “fake gradient” as, in effect,
defining your differentiable loss function, but I wouldn’t think
about it this way. Both in your code and in your mental picture
you should have an explicitly differentiable loss function, and
calculate its real gradient.

(I wouldn’t change or repost this post, but when you switch topics
like this it would be better for the forum if you would start a new
thread in the future.)

Best.

K. Frank

3 Likes