Circumventing an indifferentiablility problem

Sorry if my question isn’t appropriate to ask here! It’s a bit theory related question. But I want people to share knowledge to understand what’s happening in neural networks.

As far as I understand, Pytorch use chain rule to compute gradients of loss w.r.t. network parameters.
Therefore, when we use an indifferentiable function such as step function (torch.sign() ) in the neural network, the gradient won’t be propagated hence loss won’t decrease.
In the code below, I implemented a very simple network that contains a step function to see if I can solve an indifferentiability problem.

Here, I apply backpropagation twice. In the first backpropagation, I save gradients just before the step function and manually provide the saved gradients to the further layers skipping step function part.
Surprisingly, after multiple iterations, the loss becomes 0.
I can’t be sure why this approach helps to decrease the loss even though I’m not providing correct gradients. If anyone has an idea, please give me an insight.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim 
from torch.autograd import Variable, Function
import numpy as np

grads = {}
def save_grad(name):
    def hook(grad):
        grads[name] = grad
    return hook


fc1 = nn.Linear(2,2, bias = False)

input = Variable(torch.tensor([1,1]).float(), requires_grad=True)

wx11 =1 
wx12 =2 
wx21 =3 
wx22 =4

with torch.no_grad():
   fc1.weight.data = torch.Tensor([[wx11, wx21],
                                   [wx12,wx22]])

y = fc1(input)
z = torch.sign(y)
out = sum(z)
loss = abs(0-out) #ground truth is set to 0 this time. 

print "========outs========"
print y
print z
print out
print loss
print "========train========"
y.register_hook(save_grad('y'))
z.register_hook(save_grad('z'))
out.register_hook(save_grad('out'))
loss.backward(retain_graph=True)
y.backward(grads['z'])

gamma = 0.01
for i in range(500):
        
    for name, param in fc1.named_parameters():
        if param.requires_grad:
            param.grad.data.zero_()

    y = fc1(input)
    z = torch.sign(y)
    out = sum(z)
    loss = abs(0-out)
    y.register_hook(save_grad('y'))
    z.register_hook(save_grad('z'))
    out.register_hook(save_grad('out'))
    
    # backward
    loss.backward(retain_graph=True)
    y.backward(grads['z'])
    print "loss: ",loss.item()
    for name, param in fc1.named_parameters():
        if param.requires_grad:
            param.data = param - gamma * param.grad

and a part of the results is below

========outs========
tensor([ 4.,  6.])
tensor([ 1.,  1.])
tensor(2.)
tensor(2.)
========train========
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0
loss:  0.0

Hello,

torch.sign() can work with backpropagation.

a = torch.randn(3,3,3, requires_grad=True)
b = torch.sign(a)
b.requires_grad
>> True
b.grad_fn
>> <SignBackward at 0x7fe7727f3ef0>

Sure, but in that case (ordinarily apply backpropagation once), loss won’t decrease since the derivative of torch.sign() always returns 0 as a gradient.
So I got this output below:

========train========
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0
loss:  2.0