Finetuning two SentenceTransformers (twin BERT like)

Hello all,

Similar issues than RuntimeError: element 0 of variables does not require grad and does not have a grad_fn

I am training a double sentence transformer models (training a cosine similarity function between two embeddings) using the code below.

My issue: the loss does not change, as I suspect that the loss does not propagate (gradients of the optimizer returns None). Something must be frozen here but I really don’t see what.

class CosineLoss(nn.Module):
    def __init__(self):
        super(CosineLoss, self).__init__()
        self.loss = nn.MSELoss(reduction='sum')

    def forward(self, output1, output2, label):
        cos_sim = F.cosine_similarity(output1, output2)
        final_loss = self.loss(cos_sim, label)
        return final_loss
      
class TestModel(nn.Module):
  def __init__(self):
    super().__init__()
    self.modelA = SentenceTransformer('clip-ViT-B-32')
    self.modelB = SentenceTransformer('clip-ViT-B-32-multilingual-v1')

  def embed_A(self, inputA):
    x = self.modelA.encode([Image.open(BytesIO(requests.get(filepath).content)).convert('RGB')for filepath in inputA], convert_to_tensor=True)
    return x
  
  def embed_B(self, inputB):
    x = self.modelB.encode(inputB, convert_to_tensor=True)
    return x

  def forward(self, inputA, inputB):
    x1 = self.embed_A(inputA)
    x2 = self.embed_B(inputB)
    return x1, x2


mymodel = TestModel()
criterion = CosineLoss()
optimizer = optim.Adam(mymodel.parameters(), lr=lr,betas=betas,eps=eps,weight_decay=wd) 

for param in mymodel.parameters():
    param.requires_grad = True

for epoch in range(maxepochs):
  print('Epoch:', epoch)
  mymodel.train()
  for batch in dm.train_dataloader():
    mymodel.zero_grad()
 
    list_A,list_B,ground_truth = batch 
   
    outputA, outputB = mymodel(list_A,list_B)
    ground_truth = ground_truth.to(device)

    optimizer.zero_grad()
    loss = criterion(outputA,outputB,ground_truth)

    loss.requires_grad = True
    loss.backward()
    optimizer.step()
    print(optimizer.param_groups[0]['params'][0].grad)

print("Saving model for epoch:", epoch)
print("Total Loss for Epoch number {} is {}".format(epoch, loss))

The prints of that code are

Epoch: 0
None
None
None
None
None
Saving model for epoch: 0
Total Loss for Epoch number 0 is 0.4687237625608282
Epoch: 1
None
None
None
None
None
Saving model for epoch: 1
Total Loss for Epoch number 1 is 0.4687237625608282
Epoch: 2
None
None
None
None
None
Saving model for epoch: 2
Total Loss for Epoch number 2 is 0.4687237625608282

I believe that the line loss.requires_grad = True is the issue but getting rid of it just keeps returning the error message "element 0 of tensors does not require grad and does not have a grad_fn’.

I would love to get your help on this.

Thank you very much.
Belhal

Update: I was able to get rid of the line loss.requires_grad = True by adding .requires_grad_() to my input embeddings as in

def embed_A(self, inputA):
    x = self.modelA.encode([Image.open(BytesIO(requests.get(filepath).content)).convert('RGB')for filepath in inputA], convert_to_tensor=True).requires_grad_()
    return x
  
  def embed_B(self, inputB):
    x = self.modelB.encode(inputB, convert_to_tensor=True).requires_grad_()
    return x

Now print(loss.grad_fn) return an object, yet print(optimizer.param_groups[0][‘params’][1].grad) retuirns None and the loss still stays constant.
Seems like while my lioss is now differentiable, the optimizer does not consider my mymodel.params() as variable.

Any ideas on how to fix it?
Thank you

This also doesn’t sound like a proper solution. Calling .requires_grad_() on the output of self.modelB.encode will not stitch the computation graphs together, but will start a new one.
The operations in self.modelB (and before that) are still detached and the parameters used in these operations will not get a valid gradient (using the current output).

If the return value of self.modelB.encode is already detached (check its .grad attribute and see if it’s None), then you would need to dig into the model definition and see if gradient calculation is disabled or which operation detaches the graph.

Thank you for that precision.
You are totally right.
Basically those 2 simples lines returns None

modelB = SentenceTransformer('clip-ViT-B-32-multilingual-v1')
out = modelB.encode('a sentence', convert_to_tensor=True)
print(out.grad)
>>> None

So indeed those output embeddings (from modelA and modelB) are detached and no grad can be computed.

I wonder if it has to do with the Hugging Face SentenceTransformer model that by default is like that (not fine tuning friendly by default which would be understandable).

Any thoughts?

Thanks again, you were spot on right :slight_smile:

I’m not sure which repository you are using, but assume it’s UKPLab/sentence-transformers. If so, then note that the .encode method explicitly disables gradient calculation here so it seems to be by design.
This issue seems to be related and the answer was:

You can use the forward method of SentenceTransformers. This allow to include SentenceTransformer into a larger model.
However, the forward method expects a correctly formatted batch as input.

yes I am using this repo

from sentence_transformers import SentenceTransformer

Ok I will dig into the batch formatting.
The github ticket does not say much unfortunately and i would expect that a simple input text such as a string (as in '‘a sentence’) would work.

Thank you for your help for narrowing it down :slight_smile:

For future reference, from Nils Reimers

This should work:

encoded_input = model.tokenize(your_input)
out = model(encoded_input)

Out will contain a dict with different features. The sentence_embedding feature is the final output of the model

This is the solution to backprop using SentenceTransformer API.