# How to restrict the neural network gradient value w.r.t. some input

Hi; I’m interested to learn a function `NN(x1,x2)` such that `derivative of NN(x1,x2)` w.r.t. `x2` to be positive.

In my code; I have done `x1.shape = (N,D)` and `x2.shape = (N,1)` where `N` is the number of data points. `X = torch.cat([x1,x2],dim=-1)` then do `NN(X)`.

Now, I only want to restrict the derivative of `x2` to be positive and this makes sense as `x2` is only one dimension while keep ‘x1’ to be free but I don’t want to train two different neural network here.

I’m wondering what’s the proper and safe way of doing this ?

Hi,

How do you enforce that the derivative for `X2` is positive?

hi; that’s my question; I’m wondering if its possible

Well the derivative is what it is. You cannot really change its value.
It’s like if you have a model, how do you enforce that the output is always positive?

I’m thinking about adding regularization term such that for each X1, max(− ∂ NN (X1 ,X2 ) / ∂ X2, 0)

but I’m not sure how to do this properly

You can do this as follows:

``````X1, X2 = some tensors
loss_weighting = some constant to weight the different losses

penalty_loss = torch.max(-gX2, 0).sum() # other things than sum could be used here...

loss = task_loss + loss_weighting * penalty_loss

loss.backward()
optimizer.step()
``````

Thanks for the suggestion. Just few questions

(1) `loss = task_loss + loss_weighting * penalty_loss` and then `loss.backward()`. Why do I backward on penalty_loss ? Since `torch.max(-gX2, 0).sum() ` already gives my the derivative of `X2`; Isn’t `loss.backward()` gonna compute the second derivative of `X2` ?

`torch.max(-gX2, 0).sum()` gives you the penalty you want to apply.
I assumed that you wanted to try and apply it with gradient descent, the same way you learn your original loss.
So you want the derivative of this new penalty (and yes this is a second derivative-like thing) wrt your weights.

Oh; I see is loss_weighting lagrange multiplier ?

minimize taks_loss subjected to ∂ NN (X1 ,X2 ) / ∂ X2 > 0

equivalent to minimize taks_loss subjected to (- ∂ NN (X1 ,X2 ) / ∂ X2 < 0)

so it becomes taks_loss + loss_weighting * ( - ∂ NN (X1 ,X2 ) / ∂ X2 ) ?

Just one more question in the code it looks like

``````
for iters in range(total_runs):
# some code here
``````

since we are creating graph each iteration; does it gonna take massive memory as the training goes on ? Does the algorithm free graph after `optimizer.step()` just like regular optimization ?

It will use more memory as it does more computations but the graph will be deleted as soon as its not needed (the beginning of the next iteration).
So you don’t need to worry about leaking memory Thanks; so I can think it takes O(1) memory in term of number of train iterations

Yes O(1) in terms of the number of iterations.

Thanks for the help!

hi just one more thing to follow. how do I use `torch.autograd.grad(y,X2)` when my `y = NN(X1,X2)` is a matrix `NxD` where N is number of points and D is dimension ? I did` torch.autograd.grad(y, X2, create_graph=True,grad_outputs=torch.ones_like(y))` it returns tensor with shape `(N,)` (it take derivative of the sum over dimension); but I’m more expecting `(N,D)` that take derivative of each dimension and keep derivative of each dimension to be positive

My toy example is

``````
x = torch.randn(5,1) # my X1 above
z = torch.ones(5,1) # my X2 above
f = nn.Linear(2,3)
x = torch.cat([x,z],dim=-1)
y = f(x)

``````

To be more specifically; for example for a single data

I have a vector-valued function such that

f(x,z) = [y1,y2,y3 ] =[x+3z,x+4z,x+5z]

Then, df/dz = [3,4,5]; but the autograd gives me the sum 

so I guess the autograd computes as

`torch.ones(y.shape) @ [ [dy1 / dx, dy1 / dz] , [dy2 / dx, dy2 / dz] , [dy2 / dx, dy2 / dz] ] = [ [1,1,1] ] @ [ [1,3],[1,4],[1,5] ]`

Is there anyway that not compute this final matrix multiplication?

The naive way would be

``````J = []
for i in range(D):
out = torch.zeros(1,D)
out[i] = 1
J.append(j)
J = torch.stack(J)
dy_dz = J[:,-1]
``````

This is just not gonna work as I can’t loop high dimension every iteration and I don’t think its reasonable as I only want the dy_dz (The last column of Jacobean ) and required to loop all dimension for full Jacobean

Hi,

Unfortunately the autograd does not compute Jacobians, but vector Jacobian products.
Note that here I did not give the output of NN to the `.grad()` but the `loss` itself. There are two main reasons here:

• I would argue that the gradient you want to enforce is not on the Jacobian of the function but the gradient of the loss wrt the input.
• The loss is scalar, so a single backward pass actually gives the full Jacobian (gradients) and you can do the penalty directly without having to do a for-loop of `.grad()`

Hi

So I’m bit of confused; because my main plan is to make neural network NN(X1,X2); (where X2 is a scalar, and X1 is a vector. and the output of NN(X1,X2) could be any dimension. ) to be monotonic non-decreasing w.r.t. X2; that it to say NN(X1,X2+ epsilon) >= NN(X1,X2) for any epsilon > 0 for all output dimension. That’s why I’m restricting its derivative to be non-negative. However, this seems doesn’t gonna work in pytorch ?

Ho,

I did not understood your goal.
So yes you can do that but you will need a full Jacobian so you will need to do a for loop as you mentioned. Also a small code sample to help you get full Jacobian can be found here.

Note that this gradient penalty will not guarantee that the resulting network will be monotonic non-decreasing. It will just penalize it if it is not.

I know that it doesn’t guarantee since this is just a optimization subject to constraint but I’m fine with it.
Again, the loop isn’t very promising as dimension is so high

As for example for better clarification: NN( input = [x1,x2,x3,x4,x5,z] ) = [y1,y2,y3,y4,y5,y6,y7]

I want make NN( input = [x1,x2,x3,x4,x5,z+epsilon] ) [i] >= NN( input = [x1,x2,x3,x4,x5,z] ) [i] for all i and all epsilon > 0

I’m afraid this is a limitation of using backward mode AD (which all DL framework use) It can only compute vector Jacobian products.

yea; I see; but thanks for the help; I’m gonna think other way of forcing monotonicity than gradient method

1 Like