Is there any soft way of counting positive values with grad reserved?

Hi,

I need a differentiable metric to roughly reflect the level of positive values in a tensor, so literally counting is not necessary.

I tried two hard operations:

  1. relu+torch.sign+sum
  2. torch.count_nonzero

Result:

  1. requires_grad=True. But input.grad are all zeros after using “torch.sign” .
  2. requires_grad=False

I’m not sure if fc layer+relu+soft sign+sum is a potential option.

Any suggestions?

Thank you!

Hi Ximeng!

sigmoid (x), which is differentiable, moves “softly” from 0 to 1 as x
moves from negative to positive.

You may shift where the transition occurs, sigmoid (x - shift), and
sharpen the transition, sigmoid (sharpness * x).

Consider:

>>> import torch
>>> torch.__version__
'1.12.0'
>>> _ = torch.manual_seed (2022)
>>> x = torch.randn (5, 8, requires_grad = True)
>>> x
tensor([[-0.9788, -1.5154, -0.8222,  0.1214,  0.0716, -0.0872, -0.0253, -1.6267],
        [ 0.2230, -1.6746, -1.4725,  0.9721, -0.2191, -0.9397, -1.7756, -0.6259],
        [-1.1104,  1.1890,  1.3730,  0.4915,  0.3579, -0.1685, -0.8579, -1.0574],
        [ 0.2105,  1.9045,  1.8237,  1.5122, -0.3140, -0.0810, -1.3631, -0.0701],
        [-1.1876, -1.0787,  0.9551, -0.2958,  1.0663, -0.5134, -0.3846, -1.1481]],
       requires_grad=True)
>>> (x > 0).sum()
tensor(14)
>>> torch.sigmoid (x)
tensor([[0.2731, 0.1801, 0.3053, 0.5303, 0.5179, 0.4782, 0.4937, 0.1643],
        [0.5555, 0.1578, 0.1866, 0.7255, 0.4454, 0.2810, 0.1448, 0.3484],
        [0.2478, 0.7666, 0.7979, 0.6204, 0.5885, 0.4580, 0.2978, 0.2578],
        [0.5524, 0.8704, 0.8610, 0.8194, 0.4221, 0.4798, 0.2037, 0.4825],
        [0.2337, 0.2538, 0.7221, 0.4266, 0.7439, 0.3744, 0.4050, 0.2408]],
       grad_fn=<SigmoidBackward0>)
>>> torch.sigmoid (x).sum()
tensor(17.9145, grad_fn=<SumBackward0>)
>>> torch.sigmoid (10 * x).sum()
tensor(14.9508, grad_fn=<SumBackward0>)
>>> torch.sigmoid (100 * x).sum()
tensor(14.0741, grad_fn=<SumBackward0>)

Best.

K. Frank

Thanks Frank. But I am not sure if the sigmoid manner will possibly cause a gradient vanishment problem.

c = torch.randn(10, requires_grad=True)
d = torch.sigmoid (1000 * c).sum()
d.backward()
c.grad
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
1.6938e-22, 0.0000e+00, 0.0000e+00, 0.0000e+00])

So I tried another possible solution which might slightly mitigate the problem.
a = torch.randn(10, requires_grad=True)
b = torch.relu(a/(1e-3 + torch.abs(a))).sum()
b.backward()
a.grad
tensor([0.0000, 0.0002, 0.0003, 0.0027, 0.0000, 0.0000, 0.0000, 0.0000, 0.0012,
0.0029])

BTW, I noticed that gradients are also tiny for a normal softmax operation.
a = torch.randn(8, requires_grad=True)
b = torch.softmax(a, dim=0) .sum()
b.backward()
a.grad
tensor([7.9883e-09, 1.6591e-08, 4.4367e-09, 1.7502e-08, 1.7472e-08, 9.1981e-09,
2.4514e-09, 4.3570e-08])
Obviously, softmax should always work(eg: in self-attention block) even if Float32 has about 7 significant digits if I remember correctly. So I guess there must be some misunderstanding about the gradient above :joy:

Hi Ximeng!

1000 is very large for the multiplier used to “sharpen” the sigmoid().
This causes the sigmoid() to become quite close to a (discontinuous)
step function for which the gradients would be exactly zero. Very small
gradients (that underflow to zero) are to be expected here.

If you want “soft” counting, your “soft count” will be a floating-point number
that only approximates your actual count (and you can get useful gradients).

If you want your “soft count” to very closely approximate the true count,
its gradients will become very close to zero. That’s the unavoidable
trade-off.

The zero gradients (up to round-off error) are due to the fact that you
used sum() to reduce the result of softmax() to a scalar on which you
could call .backward(). By definition, the sum() of softmax() is exactly
one (up to round-off error), which is a constant, so the gradient of
softmax().sum() is indeed zero.

You could try, for comparison, b = torch.softmax(a, dim=0).exp().sum()
and you will see that you get non-zero gradients.

Best.

K. Frank

1 Like

Oh, I see. Thank you very much!