Autograd function with numerical gradients

pmcr · July 26, 2018, 10:09pm

I have a non-differentiable loss function. Something that takes a few tensors that require gradients, copies them, computes some stuff, and then returns the cost as a tensor.

Is there a way to force the autograd framework to compute the gradients numerically?
Or must I explicitly compute the numerical gradients?

Using autograd I have started to write this:

class torch_loss(torch.autograd.Function):

    @staticmethod
    def forward(ctx, g_T, g_pred, tsr_img, obj):
        ctx.save_for_backward(g_T, g_pred, tsr_img)
        ctx.obj = obj

        return loss_fcn(g_T, g_pred, tsr_img, obj)

    @staticmethod
    def backward(ctx, grad_output):
        g_T, g_pred, tsr_img = ctx.saved_tensors
        obj = ctx.obj

        grad_T = grad_pred = None
        # do something with grad_T, grad_pred, and grad_output

        return grad_T, grad_pred, None, None

Where loss_fcn() is the non-differentiable part. And g_T and g_pred have requires_grad=True. Also obj is not a tensor, thus it cannot be saved by save_for_backwards().

Is this doable?

What are the shapes of the output tensors for grad_T and grad_pred if they have, for example, shapes of (3, 4) and (1, 10, 20), respectively?

Mrhanddasa · July 26, 2018, 10:26pm

Actually I’m trying to do something similar to you as in here

I believe what you can do is something like this:

import torch

class MyLoss(torch.autograd.Function):  
    @staticmethod
    def forward(ctx, y, y_pred):
        ctx.save_for_backward(y, y_pred)
        ###### do what ever you want in here and then return that neumrical valu
        return (y_pred - y).pow(2).sum()

    @staticmethod
    def backward(ctx, grad_output):
        yy, yy_pred = ctx.saved_tensors
        grad_input = grad_output.clone()
        ##### return some gradient in here
        grad_input = (yy_pred) * 2.0
        return grad_input, None

dtype = torch.float
device = torch.device("cpu")
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)

w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)

learning_rate = 1e-6
for t in range(500):
    myloss = MyLoss.apply 
    y_pred = x.mm(w1).mm(w2)
    loss = myloss(y_pred, y)
    print(t, loss.item())
    loss.backward()
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()

In the example above, I’m passing both y & y_predict to the forward function, doing some operations to compute cost and then saving them. In the backward loop, I take the values of y & y_predict to compute some gradient since if the gradient is 0, everything will be zero. In the code above I’m just multiplying by 2.0 and returning that as a gradient (as an example only but you can do more operations)… I’m still waiting for someone to answer my question to ensure that my implementation is 100% correct…

pmcr · July 26, 2018, 11:03pm

Thanks. Yes, I had started doing something like that, but it does not work yet. I have added the code I’ve written above.

pmcr · July 27, 2018, 8:47am

If my input tensors to the loss function (that require gradient computation) are two matrices and the output is a single value. The grad_output is what exactly? And what should I do with it? The tensors that have to be returned by backward are tensors with the same shape as the input to the loss function, right?

I guess my grad_output is one because there are not partial derivatives needed because it is the loss function.

Is there more documentation on this? I guess this is obvious on the simple cases, but not for more complex functions.

Mrhanddasa · July 27, 2018, 5:55pm

(EDITED)

For a custom autograd function, the backward step has to return as many gradients as the number of inputs in the forward function…

class MyLoss(torch.autograd.Function):  
    @staticmethod
    def forward(ctx, y_pred, y, a, b, c):
        ctx.save_for_backward(y, y_pred)
        return (y_pred - y).pow(2).sum() * a * b * c

    @staticmethod
    def backward(ctx, grad_output):
        yy, yy_pred = ctx.saved_tensors
        return torch.neg(2.0*(yy_pred - yy)), None, None, None, None ## corresponds to y, a, b, c

You need to return the same number of inputs in the backward step since this is what autograd is expecting even though it’s not going to be used …

For Documentation:

I found the tutorial helpful (autograd section)
Extending Pytorch
Source code for torch.autograd.function
Automatic differentiation package - has more details.

I hope you find that helpful …

pmcr · July 30, 2018, 9:46am

Thanks for the links.
For my case of numerical gradients I got to this:

import torch
import numpy as np
from torch.autograd import gradcheck

eps = 1e-6

g_A = torch.rand(3, 3)
t_B = torch.rand(3, 3)
o_C = {'foo': 0, 'bar': [0, 1]}

g_A.requires_grad = True

# something simple that is actually differentiable, but it simulates a non-differentiable function.
# it also has a non tensor input
def loss_nondiff(A, B, C):
    a = A.data.numpy()
    b = B.numpy()[0]
    
    cost = np.expand_dims(np.sum(a)+np.sum(b), 0)
    print(C, cost)
    
    return torch.from_numpy(cost.astype(np.float32))

# autograd wrapper
class loss_test(torch.autograd.Function):
    
    @staticmethod
    def forward(ctx, A, B):
        ctx.save_for_backward(A, B)
        
        return loss_nondiff(A, B, o_C)
    
    @staticmethod
    def backward(ctx, grad_output):
        A, B = ctx.saved_tensors
        
        zeros = torch.from_numpy(np.zeros((3, 3)).astype(np.float32))
        grad_A = torch.from_numpy(np.zeros((3, 3)).astype(np.float32))
        
        for i in range(3):
            for j in range(3):
                teps = zeros.clone()
                teps[i, j] += eps
                
                # this grad_output is just to shut up the gradcheck, since it's always 1 for the loss function
                # i have no ideia what to do with it, specially if has a different size
                grad_A[i, j] = grad_output*(loss_nondiff(A+teps, B, o_C)-loss_nondiff(A-teps, B, o_C))/(2*eps)
        
        return grad_A, None
    

criterion = loss_test.apply
gradcheck(criterion, (g_A, t_B))

This works because it is a simple case, but for the case where the function is not a loss function, and you can have outputs of different sizes and/or multiple outputs, I still don’t know what to do grad_output or with the gradient computations.