How can custom loss function be backpropagated

I built my loss function using conditional statement like:

def myloss(data):
    if blah blah:
        loss = blah
    if blah blah:
        loss = blah
    return loss
loss = myloss(output)

I worried it won’t work but it worked.
but how can my loss function be backpropagated?
Is my loss function differentiable?

Thank you for helping me in advance.

Hello Hwarang!

In short, the conditional statement doesn’t break anything.

If loss inside of your myloss() function is calculated with pytorch
tensor operations (that have backward() implemented and are
differentiable), backpropagation through myloss() will work just fine.

So, to be concrete, let:

def myloss (data):
    if data[0][0] > 5.0:
        loss = 1.0 * (data**2).sum()
        loss = 2.0 * (data**3).sum()
    return loss

Mathematically speaking, myloss() will be differentiable everywhere
except at data[0][0] = 5.0, which is good enough.

In practice, if data[0][0] = 5.0, myloss() will take the second
branch and loss.backward() will calculate the gradient that
corresponds to loss = 2.0 * (data**3).sum().


K. Frank

I`m sorry but i think that i asked wrong question which is not my intention.
Actually, my loss function is constructed like:

def myloss(data):
    tmp = 0
    for i in range(len(data)):
        if data[i] > 0.5:
            tmp += 1
            tmp += 2
    loss = math.log10(tmp)
    return loss
loss = myloss(output)

In this case, does it work too?

Hello Hwarang!

No, this won’t work. The problem is that this version of myloss() isn’t
usefully differentiable. It is constant almost everywhere, so the gradient
will always be zero.

Mathematically, myloss() is differentiable (with zero gradient) except
when any of the data[i] = 0.5, at which values myloss() jumps
discontinuously and the derivative is not defined.

Numerically with pytorch you will always get zero gradient, even when
some data[i] = 0.5, because whatever branch of the conditional
you go through, a constant function (constant for that branch) is being

myloss() and backpropagation will “work” in the sense that calling
loss.backward() will give you a well-defined gradient, but it doesn’t
actually do you any good because the gradient is always zero.

In practical terms, let’s say that data[3] = 0.5001, and you get some
value of the loss function. Let’s also say that at data[3] = 0.5 the
loss function jumps to a lower, more favorable value so that you would
like your optimizer step to update data[3] to 0.4999. The problem is
that the optimizer only knows about the gradient, which is zero, and
doesn’t know that very nearby at 0.4999 you get a lower loss. With zero
gradient the optimizer doesn’t (and can’t) know in which direction to vary
data[3], that is, whether to increase, decrease, or leave unchanged
data[3], to get to a lower loss.

This is how gradient-descent optimization methods (which are the core
of pytorch’s backpropagation) work, and it’s an inherent limitation they


K. Frank

Thanks! I totally understand for your kind answer! thanks again!