Hi, I’m just learning about PyTorch because I need to migrate my TF 1.x code in order to be able to use adv robustness evaluation libraries. I found the process pretty painless so far, but I’m having problems understanding the difference between the TensorFlow function tf.gradients which computes the gradient vectors of a variable out with respect to a variable in, and the torch.autograd.grad function, which seems to work on scalars only instead. I’ve read other similar entries on this website but I couldn’t find relevant examples to my problem so here I am.

Going into specifics, I need to migrate the following function written in TF 1.x to PyTorch:

def cross_lipschitz(f, x, batch_size=100, z_dim=10):
reg = 0
grad_matrix_list = [tf.gradients(f[:, k], x)[0] for k in range(z_dim)]
for l in range(z_dim):
for m in range(l + 1, z_dim):
grad_diff_matrix = grad_matrix_list[l] - grad_matrix_list[m]
norm_for_batch = tf.norm(grad_diff_matrix, ord=2, axis=1)
reg += tf.reduce_sum(tf.square(norm_for_batch))
return reg

Can you help me out in understanding how to replace tf.gradients with torch.autograd.grad in this particular situation? I also apologize for the wrong code formatting, it’s not clear to me how to do indendation in the forum text editor.

torch.autograd.grad isn’t restricted to only scalar outputs, you just need to modify the grad_outputs argument of torch.autograd.grad. So, let’s say you have a function which is N-dimensional input and an M-dimensional output and a batch size of 100. You just need to add torch.ones_like(Y) as your 3rd argument to torch.autograd.grad

B=100
N=4
M=6
X = torch.randn(B, N, requires_grad=True)
fc = torch.nn.Linear(N,M) #our R^N -> R^M function
Y = fc(X)
dYdX = torch.autograd.grad(Y, X, torch.ones_like(Y)) #gradient of the function
#returns torch.Size([100, 4])

From my understanding, gradients are calculated via Jacobian-Vector products within PyTorch and so in order to calculate the gradient of all of the outputs we pass a Tensor filled with ones that is the same shape as the output in order for the gradient to be calculated for all components of our output, and for all samples in our batch.

With regards to other values, you can read in more detail in the docs torch.autograd.grad — PyTorch 1.9.0 documentation but there are retain_graph (used to keep a graph if you’re calculating gradients on the same graph multiple times), create_graph (used if you want to calculate higher-order derivatives), and allow_unused which is for inputs that have no gradient.

I’m not a developer so this is just my understanding (which could be wrong, so if it is, anyone please correct me). One thing that might help you understand the use of grad_outputs is to read up in more detail about Jacobian-Vector products and how they’re used to compute gradients in Automatic Differentiation!