Find derivative of model's paremeters wrt to a vector

Hello,
I am trying to find a double derivative using the torch.autograd.grad fucntion. It requires a step where I have to find the double derivative of the model’s parameters wrt to a vector (in this case A). Can someone please guide me how to do so?

# reproduce error
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from transformers import BertModel, BertForMaskedLM, BertConfig, EncoderDecoderModel

model1 = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints
model2 = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints

A=torch.rand(1, requires_grad=True)
optimizer1 = torch.optim.Adam(model1.parameters(), lr=0.0001)

en_input=torch.tensor([[1,2], [3,4]])
en_masks=torch.tensor([[0,0], [0,0]])
de_output=torch.tensor([[3,1], [4,2]])
de_masks=torch.tensor([[0,0], [0,0]])
lm_labels=torch.tensor([[5,7], [6,8]])

torch.autograd.set_detect_anomaly(True)

def train1():
  acc=torch.zeros(1)
  for i in range(2):
    optimizer1.zero_grad()
    out = model1(input_ids=en_input, attention_mask=en_masks, decoder_input_ids=de_output, 
                        decoder_attention_mask=de_masks, labels=lm_labels.clone())
          

    prediction_scores = out[1]
    predictions = F.log_softmax(prediction_scores, dim=2)
    p=((predictions.sum() - de_output.sum())*A).sum()
    p=torch.unsqueeze(p, dim=0)
    acc = torch.cat((p,acc))

  loss=acc.sum()
  loss.backward(inputs=list(model1.parameters()), retain_graph=True,  create_graph=True) #calculates gradients 
  # I want to do something like this. First find derivative wrt to model's weights and then wrt A, so essentially a double derivative.
  delL_delWo=(torch.autograd.grad(loss, model1.parameters(), create_graph=True, allow_unused=True)) # model1's weights -> Wo
  del2_Loss_delWo_delA=torch.autograd.grad(delL_delWo, A, allow_unused=True) #calculating gradients wrt A i.e del^2 Loss/delWo delA  
  optimizer1.step() # wt updation
 
  return del2_Loss_delWo_delA
train1()

SInce grad can be implicitly created for scalar inputs, I am confused as to how to find the second derivative i.e del2_Loss_delWo_delA, as delL_delWo will consist of a tuple consisting of 2-D tensors (similar to model1’s parameters.)

The second derivative of d^2L / dWo da is really a tuple of Jacobian matrices, where the ith entry of that tuple is a Jacobian matrix with height of p_i.numel() (where p_i is the ith parameter) and width of a.numel().

What you are actually computing when you compute 'grad is the vjp, i.e., v^T J, where v is the quantity (a tuple of tensors) you pass in as grad_out to .grad(), and J can be thought of as all the Jacobians described above concatenated. Even though you may need to pass in 2-D tensors, the Jacobian really doesn’t care about how they are organized into the shape of a tensor - all that matters is that the number of elements in the tensor becomes the size of row/col of the Jacobian.

So there is really no correct or wrong v here to pass to .grad, whatever you pass as v to .grad just determines the coefficients in the linear combination of the rows of your Jacobian. That way when you pass into grad_out, A’s grad will always be a vector with size like A.

If you really need the entire Jacobian, and not a linear combination of its rows, consider using:
https://pytorch.org/docs/stable/autograd.html?highlight=jacobian#torch.autograd.functional.jacobian

Otherwise, if you need a quantity corresponding to the size of tensor A, and are indifferent to any of the dimensions in each of your parameters, you can just pass all one’s tensors corresponding to the sizes to each of the Tensors you have as parameters.

Thanks a lot for your reply, @soulitzer. So as stated by you, I should find the jacobian of each of the model’s parameters and A which will lead to a tuple of Jacobian matrices . Actually further on I am required to update A using this calculated gradient. The gradient matrix should be of size equal to A’s shape. How am I supposed to obtain the gradient matrix of required dimensions using the tuple of Jacobians?
Also, on a side note, could you please explain the difference between ‘entire Jacobian’ and ‘linear combination of its rows’. Thank you.
I am using this code to calculate the jacobian.

No problem! Seems like since you need to update A later using this computed gradient, it actually seems like the vjp IS what you want here. In that case, I wouldn’t worry about the “entire Jacobian” to much and no further action is needed apart from just using .grad(delL_delWo, A), and using a grad_out such as (torch.ones_like(dL_dp) for dL_dp in delL_delWO). By using grad_out of ones_like(…) you are computing essentially “how sensitive A is to changes to the dL_dWo in the (1, 1, 1, …, 1) direction”, and this value can be used to update A, because its the same size as A.

If the entire Jacobian is J, the linear combination of its rows is v^T J for some arbitrary v, and so the values of v determine the coefficients of the linear combination.

Thanks again @soulitzer ! I just had one doubt. I’ll be obtaining the derivatives of each of the model’s parameters wrt A. So to update A, can I sum the obtained derivatives to obtain one single matrix of the same size as A , so that I can use the calculated gradients to update A?

I believe sum is automatic as a result of how gradients are accumulated during backward. You won’t need to do any summation yourself.

A simple example could be:

import torch

a = torch.tensor(2., requires_grad=True)
b = a * a
c = a + 1

grad_out = torch.ones_like(a)

db_da = torch.autograd.grad(b, a, grad_out, retain_graph=True)
dc_da = torch.autograd.grad(c, a, grad_out, retain_graph=True)
dbc_da = torch.autograd.grad((b, c), a, (grad_out, grad_out))

print(f"{dbc_da} = {db_da} + {dc_da}")

Thank you so much, that helped a lot! @soulitzer :slight_smile: I had another doubt, my grads for A will be the same size of A. So if my A is of size 10 then grads will of size 1x10 I have another set of grads that I calculated wrt to the model’s parameters and hence is a tuple of length 512 containing 2-d tensors. (Eg: ((tensor of size 10x10), (tensor of size 10), …) → dummy sizes) Now my task is to multiply the grads i.e A.grads x model.grads and use the resultant to update A. Any idea how to do so?

Your welcome! For this one, I probably don’t understand enough about your use case. On the surface it doesn’t make a lot of sense to update A with model.grads. Semantically what is the end result supposed to look like?

For example, if you do .grad(delL_delWo, A, grad_out=(torch.ones_like(dL_dp) for dL_dp in delL_delWO)) you are computing the direction to perturb A which will cause each of the parameter grads delL_delWo to change in the (1, 1, ..., 1) direction.

Not sure what multiplying would accomplish though…

Sorry about that @soulitzer , This is my use case. Hope you can shed some light.
I am implementing a pipeline, consisting of 2 models M1 and M2. The pipeline is divided into 3 steps.

1. Train M1 on a dataset D1_train
a. y=model1(D1_train)
b. L1=(y-actual)*A
c. l1.backward()
d. delL1/delWo=(torch.autograd.grad(L1, model1.parameters(), create_graph=True, allow_unused=True)) # model1's weights -> Wo
e. params=[]
   out_params=[]
   for i in range(len(delL1/delWo)):
     if delL1/delWo[i]!=None:
       params.append(delL1/delWo[i])
       out_params.append(torch.ones_like(delL1_delWo[i]))
   del2_L1/delWo_delA=torch.autograd.grad(params, A, allow_unused=True, grad_outputs=out_params) #calculating gradients wrt A i.e del^2 L1/delWo delA
f. optimizer1.step()   #to update model1's weights

2. Train M2 on a dataset (D2) generated by trained M1
a. D2=model1(inp)
b. y=model2(D2)
c. L2=y-actual
d. L2.backward()
e. delL2/delW=(torch.autograd.grad(loss2, model2.parameters(), create_graph=True, allow_unused=True, only_inputs=True)) #model2's weights -> W
f. params=[]
   out_params=[]
   for i in range(len(delL2/delW)):
      if delL2/delW[i]!=None:
        params.append(delL2/delW[i])
        out_params.append(torch.ones_like(delL2/delW[i]))
    for param1 in model1.parameters():
      grad=torch.autograd.grad(params, param1, allow_unused=True, grad_outputs=out_params, retain_graph=True)[0]
      del2_L2/delW_delWo.append(grad)
g. optimizer2.step() #to update model2's weights

3. Train A by reducing validation loss of model2
a. y=model2(D1_val)
b. L3=y-actual
c. delL3_delW=torch.autograd.grad(L3, model2.parameters())
// Wnat to update A using the following gradient 
// delL3/delA = (delL3/delW) x (del2_L2/delW_delWo) x (del2_L1/delWo_delA) 
d. A.grads= delL3/delW x del2_L2/delW_delWo x del2_L1/delWo_delA -> how to calculate this? (essentially a chain rule -> delWo/delA x delW/delWo x delL3/delW)
e. optimizer3.step() #to update A

So I was able to calculate, del2_L2/delW_delWo, del2_L1/delWo_delA and delL3/delW. But the shapes of all three tensors are different.
del2_L2/delW_delWo and delL3/delW → length 512 and containing tensors of variable dimensions. (Eg: ((tensor of size 10x10), (tensor of size 10), …) → dummy sizes)
del2_L1/delWo_delA → tensor of size (1x5 )

Update:

I tried doing the dot product of the tensors in del2_L2/delW_delWo and delL3/delW and adding the
results to get a new gradient. Then I multiplied the obtained gradient by del2_L1/delWo_delA. Is this right? ref: neural network - Pytorch, what are the gradient arguments - Stack Overflow (second answer) It was mentioned in the answer, According to chain rule, in order to calculate gradient of loss w.r.t to a leaf node, we can compute derivative of loss w.r.t some intermediate variable, and gradient of intermediate variable w.r.t to the leaf variable, do a dot product and sum all these up. I used this approach as I wanted to implement chain rule manually.

Thank you for sharing your use case!

Why do you want to implement the chain rule manually. Maybe its better to do dL3/dA directly? For example A_grad = torch.autograd.grad(L3, A).

This is possible if you just do 1f and 2g update steps manually.

Thanks a lot for your suggestion @soulitzer! I tried as you suggested using the optimizer implementation in Updatation of Parameters without using optimizer.step() - #4 by albanD. But I get the following error:

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

It seems that A is not a part of the computation graph, and hence torch.autograd.grad is not able to calculate its grads. Is it because I am using ‘with torch.no_grad()’ while implementing my updation step that this is happening?
Here is a working piece of code,

#reproducible code

# reproduce error
import torch
from transformers import BertModel, BertForMaskedLM, BertConfig, EncoderDecoderModel
model1 = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints
model2 = EncoderDecoderModel.from_encoder_decoder_pretrained('bert-base-uncased', 'bert-base-uncased') # initialize Bert2Bert from pre-trained checkpoints

A=torch.rand(2, requires_grad=True)
optimizer1 = torch.optim.Adam(model1.parameters(), lr=0.0001)
optimizer2 = torch.optim.Adam(model2.parameters(), lr=0.0001)
optimizer3 = torch.optim.SGD([A], lr=0.001)

en_input=torch.tensor([[1,2], [3,4]])
en_masks=torch.tensor([[0,0], [0,0]])
de_output=torch.tensor([[3,1], [4,2]])
de_masks=torch.tensor([[0,0], [0,0]])
lm_labels=torch.tensor([[5,7], [6,8]])

torch.autograd.set_detect_anomaly(True)

def update_function(param, grad, loss, learning_rate):
  return param - learning_rate * grad

def train1():
  for i in range(2):
    #optimizer1.zero_grad()
    out = model1(input_ids=en_input, attention_mask=en_masks, decoder_input_ids=de_output, 
                        decoder_attention_mask=de_masks, labels=lm_labels.clone())
        
    prediction_scores = out[1]
    predictions = F.log_softmax(prediction_scores, dim=2)
    loss1=((predictions.sum() - de_output.sum())*A).sum()
  
    loss1.backward(inputs=list(model1.parameters()), retain_graph=True,  create_graph=True) 
    #optimizer1.step()
    #updating weights
    with torch.no_grad():
      for p in model1.parameters():
        if p.grad!=None:
          new_val = update_function(p, p.grad, loss1, 0.001)
          p.copy_(new_val)

def train2():
  for i in range (2):
    #optimizer2.zero_grad()
    outputs=model1(input_ids=en_input, decoder_input_ids=en_input, output_hidden_states=True, return_dict=True)
    predictions = F.log_softmax(outputs.logits, dim=2)
    values, new_labels = torch.max(predictions, 2)

    output=outputs.decoder_hidden_states[-1]
    out=model2(input_ids=en_input, decoder_inputs_embeds=output, labels=new_labels)
    prediction_scores = out[1]
    predictions = F.log_softmax(prediction_scores, dim=2)
    loss2=((predictions.sum() - new_labels.sum())).sum()
    
    loss2.backward(retain_graph=True,  create_graph=True)
    #optimizer2.step() 
    with torch.no_grad():
      for p in model2.parameters():
        if p.grad!=None:
          new_val = update_function(p, p.grad, loss2, 0.001)
          p.copy_(new_val)
       
def train3():
  optimizer3.zero_grad()
  output = model2(input_ids=en_input, attention_mask=en_masks, decoder_input_ids=de_output, 
                      decoder_attention_mask=de_masks, labels=lm_labels.clone())
        
  prediction_scores_ = output[1]
  predictions_= F.log_softmax(prediction_scores_, dim=2)
  loss3=((predictions_.sum() - de_output.sum())).sum()
  A.retain_grad()
  A.grad=torch.autograd.grad(loss3, A) # --> error
  optimizer3.step() # wt updation 

 train1()
 train2()
 train3()

Exactly. You will need to perform the update step without no-grad mode for A to be L3’s graph, but this is a bit tricky since inplace updates (in grad-mode) aren’t allowed on parameters. You might want to use a library like higher, which handles this type of stuff for you. higher documentation — higher 0.2.1 documentation

import torch
import torch.nn as nn
import higher

model = nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

x = torch.rand(10)
y = torch.rand(10, requires_grad=True)

with higher.innerloop_ctx(model, optimizer, copy_initial_weights=True) as (fmodel, diffopt):
    for i in range(2):
        loss = (fmodel(x) * y).sum()
        diffopt.step(loss)
        
        out = fmodel(x).sum()
        print(torch.autograd.grad(out, y))

Some other things to note:

  • you don’t need to do A.retain_grad() since A is already a leaf tensor
  • instead of doing A.grad = torch.autograd.grad(…, you can do loss.backward(inputs=A)
  • if you want the gradients to flow all the way back to A, you need to be careful of performing ops that autograd isn’t able to differentiate through, i.e., _, idx = torch.max(...).

Thanks a lot for your suggestions! I finally used a finite difference approximation method to calculate the chain rule part. Thanks for your help again!.

1 Like