In short, the conditional statement doesn’t break anything.
If loss inside of your myloss() function is calculated with pytorch
tensor operations (that have backward() implemented and are
differentiable), backpropagation through myloss() will work just fine.
So, to be concrete, let:
def myloss (data):
if data[0][0] > 5.0:
loss = 1.0 * (data**2).sum()
else:
loss = 2.0 * (data**3).sum()
return loss
Mathematically speaking, myloss() will be differentiable everywhere
except at data[0][0] = 5.0, which is good enough.
In practice, if data[0][0] = 5.0, myloss() will take the second
branch and loss.backward() will calculate the gradient that
corresponds to loss = 2.0 * (data**3).sum().
I`m sorry but i think that i asked wrong question which is not my intention.
Actually, my loss function is constructed like:
def myloss(data):
tmp = 0
for i in range(len(data)):
if data[i] > 0.5:
tmp += 1
else:
tmp += 2
loss = math.log10(tmp)
return loss
loss = myloss(output)
loss.backward()
No, this won’t work. The problem is that this version of myloss() isn’t usefully differentiable. It is constant almost everywhere, so the gradient
will always be zero.
Mathematically, myloss() is differentiable (with zero gradient) except
when any of the data[i] = 0.5, at which values myloss() jumps
discontinuously and the derivative is not defined.
Numerically with pytorch you will always get zero gradient, even when
some data[i] = 0.5, because whatever branch of the conditional
you go through, a constant function (constant for that branch) is being
calculated.
myloss() and backpropagation will “work” in the sense that calling loss.backward() will give you a well-defined gradient, but it doesn’t
actually do you any good because the gradient is always zero.
In practical terms, let’s say that data[3] = 0.5001, and you get some
value of the loss function. Let’s also say that at data[3] = 0.5 the
loss function jumps to a lower, more favorable value so that you would
like your optimizer step to update data[3] to 0.4999. The problem is
that the optimizer only knows about the gradient, which is zero, and
doesn’t know that very nearby at 0.4999 you get a lower loss. With zero
gradient the optimizer doesn’t (and can’t) know in which direction to vary data[3], that is, whether to increase, decrease, or leave unchanged data[3], to get to a lower loss.
This is how gradient-descent optimization methods (which are the core
of pytorch’s backpropagation) work, and it’s an inherent limitation they
have.