Get gradient and Jacobian wrt the parameters

Here a quick scheme of my code:

input= x
f=model()   #our model is a fully connected architecture
output=f(input)

How can I get the gradient of output with relation to the model parameters ?
explanation: it’s a 1I vector, worth f(x)/ ωi
i is the i
th* element of the vector

How can I get the jacobian of output with relation to the model parameters ?
explanation: it’s a matrix I * J, whoose element are:
image

I iterate trough the element of the matrix, while J iterate trough the element of the parameters

hint (maybe)
I found here some usefull function to computejacobian
How to compute Jacobian matrix in PyTorch? - #14 by pascal_notsawo?

I have tried to derivate output w.r.t paramter param_to_vet()

myJacobian= Jacobian(output, nn.utils.param_to_vert(model.parameters()))

but result is none, maybe because param_to_vet() breaks the computationnal graph ?

Any idea how to compute jacobian wrt parameters ?

1 Like

Maybe @albanD would have an idea :slight_smile: ? You seems really knowledgeabe on this topic

Hi,

It is tricky to use these functions with nn.Modules because they compute the Jacobian of the input wrt to the output. While in your case, you want the jacobian wrt to the parameters.

You have two options here:

  • You can use a similar approach to this code that extract the weights and put them back into the Module so that you can write a function that actually takes the Parameters as input. Then you can call into functions like torch.autograd.functional.jacobian() with this.
  • Write by hand a function that reconstructs the jacobian for an nn.Module similar to the one you linked bu instead of giving x to autograd.grad, you want to give model.parameters(). To get the gradients wrt to the params and not the input.
3 Likes

I hope this will help you. The model is a multilayer perceptron with relu as activation function.

import torch
from torch import nn

def gradient(y, x, grad_outputs=None):
    """Compute dy/dx @ grad_outputs"""
    if grad_outputs is None:
        grad_outputs = torch.ones_like(y)
    grad = torch.autograd.grad(y, [x], grad_outputs = grad_outputs, create_graph=True)[0]
    return grad

def jacobian(y, x):
    """Compute dy/dx = dy/dx @ grad_outputs; 
    for grad_outputs in[1, 0, ..., 0], [0, 1, 0, ..., 0], ...., [0, ..., 0, 1]"""
    jac = torch.zeros(y.shape[0], x.shape[0]) 
    for i in range(y.shape[0]):
        grad_outputs = torch.zeros_like(y)
        grad_outputs[i] = 1
        jac[i] = gradient(y, x, grad_outputs = grad_outputs)
    return jac

class Linear(nn.Module):
    """costomized linear layer"""
    def __init__(self, in_features, out_features, bias = True, activation_function = None):
        super(Linear, self).__init__()
        self.linear = nn.Linear(in_features, out_features, bias = bias)
        self.activation_function = activation_function if activation_function else lambda x : x
    
    def forward(self, x):
        return self.activation_function(self.linear(x))

class MLP(nn.Module):
    """Multi-layer perceptron"""
    def __init__(self, in_features, hidden_features, hidden_layers, out_features, activation_function = None):
        super(MLP, self).__init__()
        net = []
        net.append(Linear(in_features, hidden_features, True, activation_function))  
        net += [Linear(hidden_features, hidden_features, True, activation_function) for _ in range(hidden_layers)]  
        net.append(Linear(hidden_features, out_features, True, None)) 
        self.net = nn.Sequential(*net)

    def forward(self, x):
        return self.net(x)

input_dim = 3
output_dim = 2
hidden_dim = 5
n_hidden = 2

model = MLP(in_features = input_dim, 
            hidden_features = hidden_dim, 
            hidden_layers = n_hidden, 
            out_features = output_dim, 
            activation_function = torch.relu)

x = torch.tensor([[1., 2., 0.], [1., 0., -1.]]) # batch of 2 tensors
y = model(x)

for param in model.parameters() :
    print(gradient(y, param))
    print()
1 Like

Hi @albanD mate,

I find your suggestions particularly helpful, cheers!

One quick question though, after ‘load_weights’ the model.parameters() are still empty. How can I restore the model parameters as well?

Thanks in advance!

The point of this code is to have inputs that (can) have history. And so these are not parameters.
If you just want to save/set the parameters, you can use the load_state_dict.

If you just want to restore the module to it’s original behavior after using this. You can modify the load weights to actually set nn.Parameter and not Tensors. After that, the .parameters() call will work again.

Hi there,

I wrote this self contained version of the idea proposed by @albanD.
It computes the Jacobian accordingly.

One question does remain, though:
Why do we have to “deparameterize” the model weights?
I struggled a bit with the implementation until I explicitly replaced the parameters with plain tensors as pointed out by @albanD.

Also, this implementation keeps the original model unchanged such that the Jacobian is in fact computed on a substitute/replicated model in order to not mess with the parameters that are possibly tracked by an optimizer.

Finally, does the function that is used to ‘pass the parameter as an input to the net’ look ok for all of you?
It seems to work but maybe there’s a more elegant solution.

import future, sys, os, datetime, argparse
from typing import List, Tuple
import numpy as np
import matplotlib

matplotlib.rcParams["figure.figsize"] = [10, 10]

import torch, torch.nn
from torch import nn
from torch.nn import Sequential, Module, Parameter
from torch.nn import Linear, Tanh, ReLU
import torch.nn.functional as F

Tensor = torch.Tensor
FloatTensor = torch.FloatTensor

torch.set_printoptions(precision=4, sci_mode=False)
np.set_printoptions(precision=4, suppress=True)

sys.path.append("../../..")  # Up to -> KFAC -> Optimization -> PHD

import copy

cwd = os.path.abspath(os.getcwd())
os.chdir(cwd)

# from Optimization.BayesianGradients.src.DeterministicLayers import GradBatch_Linear as Linear


def _del_nested_attr(obj: nn.Module, names: List[str]) -> None:
	"""
	Deletes the attribute specified by the given list of names.
	For example, to delete the attribute obj.conv.weight,
	use _del_nested_attr(obj, ['conv', 'weight'])
	"""
	if len(names) == 1:
		delattr(obj, names[0])
	else:
		_del_nested_attr(getattr(obj, names[0]), names[1:])

def extract_weights(mod: nn.Module) -> Tuple[Tuple[Tensor, ...], List[str]]:
	"""
	This function removes all the Parameters from the model and
	return them as a tuple as well as their original attribute names.
	The weights must be re-loaded with `load_weights` before the model
	can be used again.
	Note that this function modifies the model in place and after this
	call, mod.parameters() will be empty.
	"""
	orig_params = tuple(mod.parameters())
	# Remove all the parameters in the model
	names = []
	for name, p in list(mod.named_parameters()):
		_del_nested_attr(mod, name.split("."))
		names.append(name)

	'''
		Make params regular Tensors instead of nn.Parameter
	'''
	params = tuple(p.detach().requires_grad_() for p in orig_params)
	return params, names

def _set_nested_attr(obj: Module, names: List[str], value: Tensor) -> None:
	"""
	Set the attribute specified by the given list of names to value.
	For example, to set the attribute obj.conv.weight,
	use _del_nested_attr(obj, ['conv', 'weight'], value)
	"""
	if len(names) == 1:
		setattr(obj, names[0], value)
	else:
		_set_nested_attr(getattr(obj, names[0]), names[1:], value)

def load_weights(mod: Module, names: List[str], params: Tuple[Tensor, ...]) -> None:
	"""
	Reload a set of weights so that `mod` can be used again to perform a forward pass.
	Note that the `params` are regular Tensors (that can have history) and so are left
	as Tensors. This means that mod.parameters() will still be empty after this call.
	"""
	for name, p in zip(names, params):
		_set_nested_attr(mod, name.split("."), p)

def compute_jacobian(model, x):
	'''

	@param model: model with vector output (not scalar output!) the parameters of which we want to compute the Jacobian for
	@param x: input since any gradients requires some input
	@return: either store jac directly in parameters or store them differently

	we'll be working on a copy of the model because we don't want to interfere with the optimizers and other functionality
	'''

	jac_model = copy.deepcopy(model) # because we're messing around with parameters (deleting, reinstating etc)
	all_params, all_names = extract_weights(jac_model) # "deparameterize weights"
	load_weights(jac_model, all_names, all_params) # reinstate all weights as plain tensors

	def param_as_input_func(model, x, param):
		load_weights(model, [name], [param]) # name is from the outer scope
		out = model(x)
		return out

	for i, (name, param) in enumerate(zip(all_names, all_params)):
		jac = torch.autograd.functional.jacobian(lambda param: param_as_input_func(jac_model, x, param), param,
							 strict=True if i==0 else False, vectorize=False if i==0 else True)
		print(jac.shape)

	del jac_model # cleaning up

def compute_diag_Hessian(model, loss):
	'''
	Computing the diagonal Hessian layer wise and batches the computations over the layers
	@param model: model as a container for all the parameters
	@param loss: need to differentiate it
	@return:
	'''

	if not hasattr(model.parameters().__iter__(), 'grad') or model.parameters().__iter__().grad is None:
		loss.backward(create_graph=True, retain_graph=True)

	grad = [param.grad for param in model.parameters()]

	# iterate over precomputed gradient and the parameter in lockstep
	for grad, param in zip(grad, model.parameters()):
		gradgrad = torch.autograd.grad(outputs=grad, inputs=param, retain_graph=True, grad_outputs=torch.ones_like(grad), allow_unused=True)
		param.gradgrad = gradgrad  # store in conveniently in parameter


class Net(torch.nn.Module):

	def __init__(self):
		super().__init__()

		self.nn = Sequential(Linear(in_features=4, out_features=7, bias=True),
					 torch.nn.Tanh(),
					 Linear(in_features=7, out_features=3, bias=True),)

	def forward(self, x):
		return self.nn(x)

net = Net()
x = torch.randn(5,4).requires_grad_()
compute_jacobian(net, x)

out = net(x).sum()
compute_diag_Hessian(net, out)

2 Likes

Hey guys!

All the above codes are helpful. Could anyone let me know how can I find the jacobian with respect to each token for a parameter “model.gpt_neox.mlp.dense_4h_to_h.weights” fed into a large language model “Pythia”. For example if I simply take the gradient of this parameter then I will get the shape (512, 2048). However, when I will do it for every token the shape would be (seq_length, 512, 2048).

This is the code I made, but it is returning jacobian as 0.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import torch.nn as nn

class Pythia70MModel(nn.Module):
    def __init__(self, model, model_name='EleutherAI/pythia-14m'):
        super(Pythia70MModel, self).__init__()
        self.model = model
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.tokenizer.pad_token = self.tokenizer.eos_token
        input_text = 'One day, a little girl named Lily found a needle in her room...'
        self.inputs = self.tokenizer(input_text, return_tensors="pt", padding='max_length', max_length=128, truncation=True)
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, model_params):
        # Update weights directly
        # Create a new Parameter for forward computation only
        # temp_weights = model_params.clone()  # Create a clone to work with
        with torch.no_grad():
            self.model.gpt_neox.layers[0].mlp.dense_4h_to_h.weight.copy_(model_params)

        # Compute the forward pass
        outputs = self.model(**self.inputs)
        logits = outputs.logits

        # Return the loss for the first token
        # return self.loss_fn(logits[:, 0, :], self.inputs["input_ids"][:, 0])
        return logits[:,0,:]

# Initialize model and parameters
model = AutoModelForCausalLM.from_pretrained('EleutherAI/pythia-14m')
model_params = model.gpt_neox.layers[0].mlp.dense_4h_to_h.weight  # Reference to the actual weights
model_params.requires_grad_(True)  # Ensure gradients are tracked

# Instantiate the model
pythia_model = Pythia70MModel(model)

# Function to compute loss
def forward_func(model_params):
    return pythia_model.forward(model_params)

# Compute the loss
loss = forward_func(model_params)

jacobian = torch.autograd.functional.jacobian(forward_func, model_params)
print(jacobian)