I have a non-differentiable loss function. Something that takes a few tensors that require gradients, copies them, computes some stuff, and then returns the cost as a tensor.
Is there a way to force the autograd framework to compute the gradients numerically?
Or must I explicitly compute the numerical gradients?
Where loss_fcn() is the non-differentiable part. And g_T and g_pred have requires_grad=True. Also obj is not a tensor, thus it cannot be saved by save_for_backwards().
Is this doable?
What are the shapes of the output tensors for grad_T and grad_pred if they have, for example, shapes of (3, 4) and (1, 10, 20), respectively?
Actually I’m trying to do something similar to you as in here
I believe what you can do is something like this:
import torch
class MyLoss(torch.autograd.Function):
@staticmethod
def forward(ctx, y, y_pred):
ctx.save_for_backward(y, y_pred)
###### do what ever you want in here and then return that neumrical valu
return (y_pred - y).pow(2).sum()
@staticmethod
def backward(ctx, grad_output):
yy, yy_pred = ctx.saved_tensors
grad_input = grad_output.clone()
##### return some gradient in here
grad_input = (yy_pred) * 2.0
return grad_input, None
dtype = torch.float
device = torch.device("cpu")
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in, device=device, dtype=dtype)
y = torch.randn(N, D_out, device=device, dtype=dtype)
w1 = torch.randn(D_in, H, device=device, dtype=dtype, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, dtype=dtype, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
myloss = MyLoss.apply
y_pred = x.mm(w1).mm(w2)
loss = myloss(y_pred, y)
print(t, loss.item())
loss.backward()
with torch.no_grad():
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
w1.grad.zero_()
w2.grad.zero_()
In the example above, I’m passing both y & y_predict to the forward function, doing some operations to compute cost and then saving them. In the backward loop, I take the values of y & y_predict to compute some gradient since if the gradient is 0, everything will be zero. In the code above I’m just multiplying by 2.0 and returning that as a gradient (as an example only but you can do more operations)… I’m still waiting for someone to answer my question to ensure that my implementation is 100% correct…
If my input tensors to the loss function (that require gradient computation) are two matrices and the output is a single value. The grad_output is what exactly? And what should I do with it? The tensors that have to be returned by backward are tensors with the same shape as the input to the loss function, right?
I guess my grad_output is one because there are not partial derivatives needed because it is the loss function.
Is there more documentation on this? I guess this is obvious on the simple cases, but not for more complex functions.
For a custom autograd function, the backward step has to return as many gradients as the number of inputs in the forward function…
class MyLoss(torch.autograd.Function):
@staticmethod
def forward(ctx, y_pred, y, a, b, c):
ctx.save_for_backward(y, y_pred)
return (y_pred - y).pow(2).sum() * a * b * c
@staticmethod
def backward(ctx, grad_output):
yy, yy_pred = ctx.saved_tensors
return torch.neg(2.0*(yy_pred - yy)), None, None, None, None ## corresponds to y, a, b, c
You need to return the same number of inputs in the backward step since this is what autograd is expecting even though it’s not going to be used …
Thanks for the links.
For my case of numerical gradients I got to this:
import torch
import numpy as np
from torch.autograd import gradcheck
eps = 1e-6
g_A = torch.rand(3, 3)
t_B = torch.rand(3, 3)
o_C = {'foo': 0, 'bar': [0, 1]}
g_A.requires_grad = True
# something simple that is actually differentiable, but it simulates a non-differentiable function.
# it also has a non tensor input
def loss_nondiff(A, B, C):
a = A.data.numpy()
b = B.numpy()[0]
cost = np.expand_dims(np.sum(a)+np.sum(b), 0)
print(C, cost)
return torch.from_numpy(cost.astype(np.float32))
# autograd wrapper
class loss_test(torch.autograd.Function):
@staticmethod
def forward(ctx, A, B):
ctx.save_for_backward(A, B)
return loss_nondiff(A, B, o_C)
@staticmethod
def backward(ctx, grad_output):
A, B = ctx.saved_tensors
zeros = torch.from_numpy(np.zeros((3, 3)).astype(np.float32))
grad_A = torch.from_numpy(np.zeros((3, 3)).astype(np.float32))
for i in range(3):
for j in range(3):
teps = zeros.clone()
teps[i, j] += eps
# this grad_output is just to shut up the gradcheck, since it's always 1 for the loss function
# i have no ideia what to do with it, specially if has a different size
grad_A[i, j] = grad_output*(loss_nondiff(A+teps, B, o_C)-loss_nondiff(A-teps, B, o_C))/(2*eps)
return grad_A, None
criterion = loss_test.apply
gradcheck(criterion, (g_A, t_B))
This works because it is a simple case, but for the case where the function is not a loss function, and you can have outputs of different sizes and/or multiple outputs, I still don’t know what to do grad_output or with the gradient computations.