Implementing jacobian differential for loss function

I’m interested in calculating the jacobian differential for my loss function. The input into this loss function will be two large vectors, i would like to calculate the jacobian differential such that at each vector position we calculate its own partial differential value.

Then i’d like to know how i can backpropogate this vector gradient back through the network.

1 Like

I use the following code to compute Jacobian:

def compute_jacobian(inputs, output):
    :param inputs: Batch X Size (e.g. Depth X Width X Height)
    :param output: Batch X Classes
    :return: jacobian: Batch X Classes X Size
    assert inputs.requires_grad

    num_classes = output.size()[1]

    jacobian = torch.zeros(num_classes, *inputs.size())
    grad_output = torch.zeros(*output.size())
    if inputs.is_cuda:
        grad_output = grad_output.cuda()
        jacobian = jacobian.cuda()

    for i in range(num_classes):
        grad_output[:, i] = 1
        output.backward(grad_output, retain_graph=True)
        jacobian[i] =

    return torch.transpose(jacobian, dim0=0, dim1=1)

I’m not the author of this code and I don’t remember where did I find it.

It works well when the size of the output is not large because it will do as many backward passes as the size of the output.

backpropogate this vector gradient back through the network.

I’m not sure if I understand this correctly, could you please clarify this?

this is pretty much what i wanted, essentially i wanted to pass a vector in for the loss function instead of a scalar and this is a nice by-pass. If i understand it correctly it essentially runs backpropogation utilizing each index of the vector as the loss for each time. so the first backpropogation uses index 0,1, the second uses index 0,2 and so on…until we completely go through the vector. is this correct?

Your suggestion works for fc-layer. But I am wondering: is there a way to compute the jabocian matrix regarding to convolutional layers ?

Flatten both input and output of a conv-layer is not feasible (i.e. [batch, 64, 28, 28] ==> [batch, 64 * 784]), since it would make tensor variable not a leaf node, which has no grad.