Problem when computing batch Jacobian

I have a problem when computing batch Jacobian. I am not sure if it is bug or I am using autograd engine incorrectly.

I used the following snippet to compute Jacobian of the output. I compared Jacobian computed by pytorch against Theano/Lasagne network initialized with the identical parameters. For the starting output 0 the results are identical. However, for the subsequent backward calls i > 0, the results differ by some constant factor (in some cases 2 or 3 but not always deterministic).

What are the reasons that the gradient accumulated multiple times in the leafs even after the input gradient was reset?

import torch
from torch.autograd.gradcheck import zero_gradients

def compute_jacobian(inputs, output):
    assert inputs.requires_grad
    num_classes = output.size()[1]

    jacobian = torch.zeros(num_classes, *inputs.size())
    grad_output = torch.zeros(*output.size())
    if inputs.is_cuda:
        grad_output = grad_output.cuda()
        jacobian = jacobian.cuda()

    for i in range(num_classes):
        zero_gradients(inputs)
        grad_output.zero_()
        grad_output[:, i] = 1
        output.backward(grad_output, retain_variables=True)
        jacobian[i] = inputs.grad.data

    return jacobian

Have you solved this problem? I tested it with some simple snippets, it works fine. Maybe you can provide more information if the problem still exists. Backward many times may have some problems if you don’t handle it carefully.

@chenyuntc I found the reason: importing theano somehow conflicts with pytorch buffers during second backward call. Uncommenting theano, gives identical results. Interestingly, if the pytorch model and all variables are cuda, then the script below passes. I used latest theano-dev and official pytorch wheel.

Script to reproduce:

import numpy as np

import theano # comment / uncomment
import torch
from torch import nn
from torch.autograd import Variable
from torch.autograd.gradcheck import zero_gradients

model = nn.Sequential(
    nn.Linear(784, 1000),
    nn.ReLU(),
    nn.Linear(1000, 1000),
    nn.ReLU(),
    nn.Linear(1000, 10))

x = Variable(torch.rand(100, 784), requires_grad=True)
y = model(x)

grad_var = torch.zeros(*y.size())
grad_var[:, 0] = 1
y.backward(grad_var, retain_variables=True)
x_grad1 = x.grad.data.numpy().copy()
zero_gradients(x)
grad_var.zero_()
grad_var[:, 0] = 1
y.backward(grad_var, retain_variables=True)
x_grad2 = x.grad.data.numpy().copy()
assert np.allclose(x_grad1, x_grad2)