Use max operation in loss function

netaglazer · January 23, 2020, 9:53pm

hi, im very confused.
from all i know, mathematicly, max operation is not differential
i read here a lot of contrasting answers -
some people are saying that it is not differential, and therefor cant be used in a loss function
and some are saying that it is differential “enough” for backpropagation

does someone have an answer and a good explanation about that issue?

KFrank · January 23, 2020, 11:37pm

Hello Neta!

Consider this one-dimensional (single-variable) function that
uses max:

f (x) = max (x, 0)

This function is differentiable for all values of x except when
x = 0. It is not differentiable exactly at x = 0, but the function
isn’t crazy. You could choose to define (not mathematically
correctly, though) to be 0 or 1 or 1/2 when x = 0, and for
practical purposes, for example, for back-propagation, do a
perfectly reasonable job. Most of the time you won’t be
back-propagating exactly through x = 0, and even if you do,
you probably won’t do so again on the next iteration.

Similarly, consider this function of two variables:

f (x, y) = max (x, y)

This function is differentiable (that is the gradients – the partial
derivatives with respect to x and y– are well defined) for all
values of x and y except along the line where x = y.

The same reasoning applies here. You usually won’t try to
back-propagate through a point where x = y, and even if
you do, using 0 or 1 or whatever for the partial derivative in
question will be good enough.

In practice we know – and lots of experience proves – that this
works.

(Now if autograd returned NaN or 10,000,000 or something
when you hit one of the rare points where max is not technically
differentiable, your training would likely break. But autograd
uses some reasonable value like 0 or 1 or 1/2, and everything
works fine.)

Good luck!

K. Frank

Minqi_Jiang · December 16, 2021, 1:17am

Actually the max (x,0) is like the relu activation function, which is differential.

ado_sar · June 29, 2024, 8:36pm

Does it uses 1/2?

With the following snippet:

x = torch.tensor([2., 1., 2., 0.], requires_grad=True)
out = torch.max(x)
out.backward()

I get:

x.grad
tensor([0.5000, 0.0000, 0.5000, 0.0000])