Is there any soft way of counting positive values with grad reserved?

Ximeng · August 12, 2022, 3:19am

Hi,

I need a differentiable metric to roughly reflect the level of positive values in a tensor, so literally counting is not necessary.

I tried two hard operations:

relu+torch.sign+sum
torch.count_nonzero

Result:

requires_grad=True. But input.grad are all zeros after using “torch.sign” .
requires_grad=False

I’m not sure if fc layer+relu+soft sign+sum is a potential option.

Any suggestions?

Thank you!

KFrank · August 12, 2022, 3:54pm

Hi Ximeng!

sigmoid (x), which is differentiable, moves “softly” from 0 to 1 as x
moves from negative to positive.

You may shift where the transition occurs, sigmoid (x - shift), and
sharpen the transition, sigmoid (sharpness * x).

Consider:

>>> import torch
>>> torch.__version__
'1.12.0'
>>> _ = torch.manual_seed (2022)
>>> x = torch.randn (5, 8, requires_grad = True)
>>> x
tensor([[-0.9788, -1.5154, -0.8222,  0.1214,  0.0716, -0.0872, -0.0253, -1.6267],
        [ 0.2230, -1.6746, -1.4725,  0.9721, -0.2191, -0.9397, -1.7756, -0.6259],
        [-1.1104,  1.1890,  1.3730,  0.4915,  0.3579, -0.1685, -0.8579, -1.0574],
        [ 0.2105,  1.9045,  1.8237,  1.5122, -0.3140, -0.0810, -1.3631, -0.0701],
        [-1.1876, -1.0787,  0.9551, -0.2958,  1.0663, -0.5134, -0.3846, -1.1481]],
       requires_grad=True)
>>> (x > 0).sum()
tensor(14)
>>> torch.sigmoid (x)
tensor([[0.2731, 0.1801, 0.3053, 0.5303, 0.5179, 0.4782, 0.4937, 0.1643],
        [0.5555, 0.1578, 0.1866, 0.7255, 0.4454, 0.2810, 0.1448, 0.3484],
        [0.2478, 0.7666, 0.7979, 0.6204, 0.5885, 0.4580, 0.2978, 0.2578],
        [0.5524, 0.8704, 0.8610, 0.8194, 0.4221, 0.4798, 0.2037, 0.4825],
        [0.2337, 0.2538, 0.7221, 0.4266, 0.7439, 0.3744, 0.4050, 0.2408]],
       grad_fn=<SigmoidBackward0>)
>>> torch.sigmoid (x).sum()
tensor(17.9145, grad_fn=<SumBackward0>)
>>> torch.sigmoid (10 * x).sum()
tensor(14.9508, grad_fn=<SumBackward0>)
>>> torch.sigmoid (100 * x).sum()
tensor(14.0741, grad_fn=<SumBackward0>)

Best.

K. Frank

Ximeng · August 13, 2022, 1:47am

Thanks Frank. But I am not sure if the sigmoid manner will possibly cause a gradient vanishment problem.

c = torch.randn(10, requires_grad=True)
d = torch.sigmoid (1000 * c).sum()
d.backward()
c.grad
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
1.6938e-22, 0.0000e+00, 0.0000e+00, 0.0000e+00])

So I tried another possible solution which might slightly mitigate the problem.
a = torch.randn(10, requires_grad=True)
b = torch.relu(a/(1e-3 + torch.abs(a))).sum()
b.backward()
a.grad
tensor([0.0000, 0.0002, 0.0003, 0.0027, 0.0000, 0.0000, 0.0000, 0.0000, 0.0012,
0.0029])

BTW, I noticed that gradients are also tiny for a normal softmax operation.
a = torch.randn(8, requires_grad=True)
b = torch.softmax(a, dim=0) .sum()
b.backward()
a.grad
tensor([7.9883e-09, 1.6591e-08, 4.4367e-09, 1.7502e-08, 1.7472e-08, 9.1981e-09,
2.4514e-09, 4.3570e-08])
Obviously, softmax should always work(eg: in self-attention block) even if Float32 has about 7 significant digits if I remember correctly. So I guess there must be some misunderstanding about the gradient above

KFrank · August 13, 2022, 7:54pm

Hi Ximeng!

1000 is very large for the multiplier used to “sharpen” the sigmoid().
This causes the sigmoid() to become quite close to a (discontinuous)
step function for which the gradients would be exactly zero. Very small
gradients (that underflow to zero) are to be expected here.

If you want “soft” counting, your “soft count” will be a floating-point number
that only approximates your actual count (and you can get useful gradients).

If you want your “soft count” to very closely approximate the true count,
its gradients will become very close to zero. That’s the unavoidable
trade-off.

The zero gradients (up to round-off error) are due to the fact that you
used sum() to reduce the result of softmax() to a scalar on which you
could call .backward(). By definition, the sum() of softmax() is exactly
one (up to round-off error), which is a constant, so the gradient of
softmax().sum() is indeed zero.

You could try, for comparison, b = torch.softmax(a, dim=0).exp().sum()
and you will see that you get non-zero gradients.

Best.

K. Frank

Ximeng · August 14, 2022, 12:37am

Oh, I see. Thank you very much!