Gradient of Threshold function

The Threshold activation function doesn’t seem to be differentiable (Gradient should be 0 everywhere except on the jump where it’s non differentiable).

How is handled the backpropagation with this type of activation function, is it ignored ?

Hi Lebourdias!

Pytorch’s Threshold is (usefully) differentiable. When its input, x, is
greater than the threshold value, its output is x and the gradient is 1.0.

This description describes a step function (the Heaviside function),
rather than pytorch’s Threshold.

For a step function (whose gradient you describe), you can use
pytorch’s heaviside() function. The output of heaviside() carries
grad_fn = <NotImplemented> (if either of its inputs carry
requires_grad = True). This is pytorch’s way of telling you that
this function isn’t usefully differentiable.

You can also code a step function “by hand:”

y = (x > 0.0).float()

In this case, the result will not carry requires_grad = True, even if
the input does.

Your main point – that the step function – isn’t usefully differentiable
is quite correct. Even though the gradient of zero (almost everywhere)
is perfectly sensible and mathematically correct, it’s not useful for
gradient-descent-based optimization because a zero gradient doesn’t
tell you by how much nor in what direction you should adjust your model
weights in order to reduce your loss criterion.

As to why pytorch chooses to have the gradient of heaviside() be
NotImplemented (rather than have requires_grad = False), I don’t
know.

(Depending on your use case, you can use sigmoid() as a “soft” step
function that is usefully differentiable.)

Best.

K. Frank

1 Like