If a parameter is trainable in the following case?

Hi Guo!

Your reasoning is correct. The thresholding operation is not (usefully)
differentiable, so autograd does not backpropagate through it.

To verify:

>>> import torch
>>> torch.__version__
'1.10.2'
>>> A = torch.tensor ([3., 4., 5., 6.], requires_grad = True)
>>> THRES = torch.tensor ([4.], requires_grad = True)
>>> A > THRES
tensor([False, False,  True,  True])
>>> A[A > THRES].sum().backward()
>>> A.grad
tensor([0., 0., 1., 1.])
>>> THRES.grad == None
True

The problem is that when A < THRES, the derivative (with respect
to both A and THRES) is zero, and when A > THRES, the derivative
is also zero (and when A == THRES, the derivative is infinite or, if you
prefer, undefined). So even if autograd were to track the derivative,
you wouldn’t be able do anything useful with it.

As it stands, THRES is not trainable (with a gradient-descent optimizer).

But it is perfectly reasonable to use differentiable “soft” thresholding:

>>> A.grad = None
>>> alpha = 2.5   # sharpness of "soft" step function
>>> (alpha * (A - THRES)).sigmoid()
tensor([0.0759, 0.5000, 0.9241, 0.9933], grad_fn=<SigmoidBackward0>)
>>> (A * (alpha * (A - THRES)).sigmoid()).sum().backward()
>>> A.grad
tensor([0.6016, 3.0000, 1.8004, 1.0930])
>>> THRES.grad
tensor([-4.0018])

With soft thresholding, THRES has a perfectly well-defined (and useful)
gradient, and you could use that gradient to train / optimize THRES.

(Note, in this example, A.grad comes from two terms – one from the
first A in the product A * (alpha * (A - THRES)).sigmoid(), and
one from the second A in the product – in the (A - THRES).sigmoid()
term.)

Best.

K. Frank

2 Likes