How much does the importance of the input pixel change wrt a certain weight

I have kind of a hard derivative to calculate with autograd: I want the importance of an input pixel (so derivative of the loss wrt that pixel) derived wrt a specific weight/layer.

So basically d(dL/dI)/dw. How do I make sure autograd understands this? When I just try to derive it like so:

With self.gradients being the gradient of the loss wrt the input and layer.weight the weights of a layer, I get the worning that self.gradients has an empty computation graph, even though I compute it like this:

    self.model.eval()  # Set the model to evaluation mode
    input_data.requires_grad = True  # Set requires_grad to True to compute gradients
    self.label = label

    if input_data.grad is not None:
        input_data.grad.zero_()
    # Forward pass
    # print("for the real image the size is "+str(input_data.size()))
    outputs = self.model(input_data) 
    if self.last_layer_linear:
        self.activations["output"]=outputs
    self.input = input_data 

    target = torch.zeros(outputs.size(), dtype=torch.float)
    target[0][label] = 1.0
    self.target = target
    self.loss = self.default_loss(outputs, target)

    # Backpropagate to compute gradients with respect to the output
    self.loss.backward(retain_graph=True)
    
    # Get the gradients with respect to the input

    self.gradients = input_data.grad.clone().detach()

    self.gradients.requires_grad = True

If anyone has any ideas I would be very happy!

Hi @SchulzKilian,

If you’re trying to do higher-order derivatives, I’d advise you look at the torch.func library and use their higher-order derivative API. It will allow you to compute exactly what you want, without having to do multiple calls to autograd and track first-order gradients.

The docs are here: torch.func — PyTorch 2.3 documentation

1 Like

thanks a lot, im looking into it right now.

Im always surprised by how much the autograd is doing for you. Like to think that it understands the weight values going into the input pixel importance is cool to me, i hope that works

to be honest i dont find enough documentation yet to understand torch.func at my level of pytorch understanding, couldnt I just do a double backward pass maybe to first get the gradients of the input and then another backwards pass witht those in order to get the gradient of the weight with respect to the gradient of the input data with respect to the loss?

So something like this as a minimum example:

def backward(self):

    weights = torch.ones_like(self.gradients)
    self.model.zero_grad()

    self.gradients.backward(weights,retain_graph=True)

    
    for name, param in self.model.named_parameters():
        print(name, param.grad)

Have a look at this thread to get a general idea of what to do with torch.func

1 Like

I’ve played around with this approach now but cant seem to come to a proper result. Because transferred to my approach, the function i am deriving would be the backpropagation right? cause that gives me the gradients of the input wrt the loss, and then i should do the hessian of that with respect to the weights. But how do I put the weights as an argument into the backward propagation function, they are just implicitly there, or am I missing something?

You could have a function that returns d_Loss/d_input, whose arguments are dict(model.named_parameters()) and inputs, then take the hessian of that w.r.t your parameters. You can just compose the gradient functions and it should work.

If you can share a minimal reproducible script for your model (or at least the input and output size), it should be possible to do.

1 Like

ohh i understand what you mean. my problem though is that Im not actually working on a model but on an optimizer that uses the hessian (but the optimizer does take the model as an initial argument)

This makes it tough to create a function that takes the parameters and the inputs and gives out the gradients because it would have to compute that on the fly, correct?

It shouldn’t do because you can create the function that defines the gradients at the initial optimization step, then just pass in the parameters at epoch, t, with the inputs as well and it will return the derivatives for the parameters at that epoch.

When creating jacobian = jacrev(fcall, argnums=(0)) you create a function whose argument is fcall, which I’ve just called to be a functionalize call of the model (from torch.func.functionalize), then you can pass params and inputs to the jacobian function at every epoch for each new set of parameters.

1 Like

i tried your way and i think its the right thing for me, now I have space issues and will try to run on google collab as soon as possible. Just the quick snippets, to make sure i understood you well:

This is how i did the model function in order to have the layer i want to propagate to as a meaningful argument (i dont assume theres a way for the backpropagation to just understand from the model parameter that the layer is an argument in there?

This is my first jacrev function that should get the gradients

And this one how i then get the hessian from the first jacoban_func with respect to eac

You shouldn’t call model.forward as you should call the method directly via model(x), and when using torch.func you need to create a functionalize version of your model via,

fcall = lambda params, x: torch.func.functionalize(model, params, x)

This jacrev call is fine, but you need to specify the argument to which you differentiate, which defaults to argnum=0

For the hessian calculation there a few mistakes, when computing the hessian you need to pass the params as a dict (no need to include the name), and you need to vectorize the hessian call otherwise you’ll compute the hessian w.r.t all samples in your batch and that will OOM.

You want something like this,

params = dict(net.named_parameters()) 

fcall = lambda params, x: torch.func.functionalize(model, params, x)

def calc_loss(params, x, target):
  output = fcall(params, x)
  loss = default_loss(output, target)
  return loss

calc_jacobian = jacrev(calc_loss, argnums=(0)) #check the args here
calc_hessian = jacrev(calc_jacobian, argnums=0)

hessian_per_sample = vmap(calc_hessian, in_dims=(None,0,0))(params, input, target)

Also, wasn’t your loss d(dL/dI)/dw so derivative w.r.t input w.r.t params? So why are you computing the Hessian of the params?

If you are computing the Hessian w.r.t the params it’s usually intractable due to its size, but if you’re multiplying it by a vector afterwards it might become tractable via a Hessian-vector product.

1 Like

I was hoping thats what im doing? First getting the jacobian without argnums specified so input, and then again the jacobian with argnums three so the model parameters? But maybe theres something Im getting wrong…

The functionalization I now got, thanks, I didnt know i could specify the model as an argument.

Appreciate your help a lot man

I might be implementing it wrong in the code.

I use the jacobian thanks to your help and it gives me all the gradients correctly:

Now I choose parts of the input and get a binary matrix in the shape of the input to specify if I want to take something into account.

Then for each parameter w I want to adjust it according to its d(dL/dI)dw for the sum of the inputs I selected.

I know this is not straight forward, thats why I kind of get stuck.
Do you wanna be my tutor? Serious question Im a broke uni student but if you could help me with my problem over like a half an hour session Id definitely pay like 20 bucks lol (not actually lol im kinda being serious)

I’d stick with consistent argument signatures, but as long as you’re aware of the change in arguments it shouldn’t matter.

If you compose the function to compute the partial derivative w.r.t to params and input, you should get a dictionary containing the derivative for each part of the network. From there, you can select a particular layer and get what you need.

1 Like

I’ve tried around with that way, also using vmap but it still seems to be too big for my laptop.
hessian_per_sample = vmap(hessian, in_dims=(None,0,None))(dict(self.model.named_parameters()), self.input, self.target)

After calling it like this I get the error cannot allocate memory… Can I maybe go through only parts of the input and do it? Like, how can I pass the entire input as the input to the hessian, but only differentiate wrt certain pixels and the same for the weight parameters?
How can I pass all the parameters for the model to work but only differentiate wrt some layers?

How many parameters is your model?

Also, can you share a snippet for how your calculating the entire derivative term? Not just 1 line?

If you only want this partial derivative hessian term to then multiple it by a vector (of the same shape as params), then you could try a hessian-vector product?

1 Like

I have a couple of models Im testing with because I want the optimizer to work in a generalizable way, the smallest has 12855045 parameters but I would love for it to work with bigger models. I read about the hessian vector product and it seems great, I dont fully understand it though. Am I losing a lot of information?

These are the lines where I calculate the hessian:

In the end product I would not use all the input though to differentiate

Yeah, so if you have a model of over 12M parameters, computing a second derivative is most likely intractable (at least on a single GPU).

What do you want to with the hessian term once you’ve calculated it?

1 Like

i basically want to get a group of input pixels and see for each weight how much the weight was participating into taking these input pixels into account, and then punish each weight by a term relative to how much it took those pixels into account.

So I would get the hessian of the loss wrt input wrt parameters and take a parameter, then get the sum of how much it took into account the marked input pixels.

So, you’re computing the second derivative with a shape [batch_size, num_input, num_params] and you want you want to multiply that by the input shape and get a resultant tensor of shape: [batch_size, num_params]?

But couldn’t just try the weight_decay option for Adam but modify it for the gradient instead? So, you penalize the l2-norm of the gradient (rather than its weight)