Based on my understanding of back prop and gradient descent,
Loss is multiplied to gradient when taking a step with gradient descent.
So when gradient becomes negative, gradient descent takes a step in the opposite direction.

Such idea is well captured when implementing gradient ascent,
as it can simply be implemented by multiplying -1 to the loss.

Then, what happens if my loss starts from positive and goes below zero?
If what I said above is correct, training will go for minimization when the loss is positive and go for maximization when the loss is negative.
And this means that gradient descent is going for zero instead of minimum value…

The loss functions are chosen in such a way that you minimize to 0. mse, l1 can’t be zero. The deviations from error 0 is what we are trying to minimise. Have a look at the loss functions available in pytorch. In all the cases, when you give equal value the loss reduces to zero. It is either maximizing a negative term or minimizing a positive term but it is always going to zero.

This isn’t true. All common optimization algorithms I’m aware
of – and in particular, gradient descent – only care about the
gradient of the loss, and not the loss itself.

Plain-vanilla gradient descent takes the following optimization
step:

new weights = old weights - learning rate * gradient

Could you have misread “learning rate” for “loss” at some point?

Gradient descent (and, again, all common algorithms that I am
aware of) seek to minimize the loss, and don’t care whether that
minimum value is a large positive value, a value close to zero,
exactly zero, or a large negative value. It simply seeks to drive
the loss to a smaller (that is, algebraically more negative) value.

You could replace your loss with

modified loss = conventional loss - 2 * Pi

and you should get the exact same training results and model
performance (except that all values of your loss will be shifted
down by 2 * Pi).

It is the case that we often use loss functions that become equal
to zero when the fit of the model to the training data is perfect,
but the optimization algorithms don’t care about this, and they
drive the loss function to algebraically more negative values,
and not towards zero.

Oh I thought, loss affects the gradient descent directly.
So it simply provides the search space and training a model purely depends on learning rate, not the loss

flipping the sign of loss turns the problem into maximization because within the search space everything is flipped (min point becoming max point).

If I understand what you are saying, yes that is correct.

But, just to be sure, you could try the following exercise:

Set up a simple neural network and train it using gradient
descent on some well-behaved data set. Make a plot of
your training loss (and, while you’re at it, your training
accuracy) as a function of batch number or epoch number,
and make sure that it is training stably and that you’re getting
reasonable results.

Use a standard loss function when you do this. Let’s call this
loss-original.

Let’s say that your loss runs from 1.0 down to 0.1 when you
train.

and train your neural network again using these two modified
loss functions and make your loss and accuracy plot for each
of these two modified training runs. See if you get the results
you expect, and, if not, post what you got and ask any questions
that you might have.

Thanks Frank, I did the exercise. It behaves as I expected.
Training is fine and generates exactly same accuracy for loss-shifted (even though loss is < 0).
For loss-negative , training fails, the graph say that loss decreases but since the sign is flipped, conceptually it is increasing the loss by applying gradient ascent.

I actually have another question about loss.
From our previous discussion, it is clear that value of loss itself does not mean anything.
what actually matter is the gradient with respect to input and their direction (sign).

I am in the situation where I have to define a new loss function that is not differentiable.
Based on my understanding, (though it is not going to be easy) it is possible to train a model with non-differentiable loss function, if I define a custom Backward function that returns “fake” gradient with appropriate direction and magnitude.

I would not recommend doing things this way (or thinking about it
this way). The best approach – because network training relies
so heavily on gradient-based optimization – is to make your loss
function differentiable.

Consider the softmax function (which should more properly be
called the “soft-argmax” function): It can be understood as a
differentiable approximation to the argmax function (which is
not differentiable because it jumps around discretely).

(In a similar vein, the sigmoid function can be considered a
differentiable version of the step function.)

So you should look at your loss function, and try to find a sensible
differentiable replacement for it that has more or less the same
structure.

You could view what you call the “fake gradient” as, in effect,
defining your differentiable loss function, but I wouldn’t think
about it this way. Both in your code and in your mental picture
you should have an explicitly differentiable loss function, and
calculate its real gradient.

(I wouldn’t change or repost this post, but when you switch topics
like this it would be better for the forum if you would start a new
thread in the future.)

Okay, I will see if I can create differentiable variation of the loss function I am looking for.

Let me rephrase your passage about softmax, just to confirm that I understood correctly.
Softmax is essentially continuous version of argmax as it also takes in the same input but returns very similar outputs (1 if the greatest else 0).
And while argmax is not differentiable softmax is, so it can nicely replaces argmax while enabling backprop through the framework.
I think this is pretty brilliant and I feel like I kinda know what I need to do. Thanks