Model not training, gradients are None

Hey everyone!
I’m currently finetuning a pretrained sentence transformer with in-domain data.

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: XLMRobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Below you can find a snippet of the code:


from sentence_transformers import SentenceTransformer
import torch
activation = {}

def hook(name, output):
    activation[name] = output[0].detach()

model = SentenceTransformer('T-Systems-onsite/cross-en-de-roberta-sentence-transformer')
model.train()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
cos_sim = torch.nn.CosineSimilarity(dim=0)
optimizer.zero_grad()

query_prediction = model.encode("a man is cutting up a tomato", convert_to_tensor=True)
positive_prediction = model.encode("a man is slicing a tomato", convert_to_tensor=True)
negative_prediction = model.encode("she's brushing her hair", convert_to_tensor=True)

dist_pos = 1 - cos_sim(query_prediction, positive_prediction)
dist_neg = 1 - cos_sim(query_prediction, negative_prediction)

loss = torch.max(torch.tensor(0.), 0.7 + dist_pos - dist_neg)

if loss != torch.tensor(0.):
    loss.backward()
for p in model.parameters():
    print(p.grad)

for name, layer in model.named_modules():
    layer.register_forward_hook(hook(name,query_prediction))
print(activation)

The loss is being calculated, but the gradients are None. Therefore, the model is not training.

When going through the sentence transformer code, within the encode method the forward seems to be calculated with no grad. Might that be the problem?

Any tips or ideas on why the gradients are None would be much appreciated.

Did you verify that loss.backward() was called?
If so, do you see valid .grad_fn attributes for loss, dist_pos, dist_neg, query_***, ***_preciction?

Thank you for replying.
The loss.backward() is in fact called when the batch loss is not 0.
The output of .grad_fn attributes is:

>  loss <MaximumBackward object at 0x7f20ea3fb3d0>
>  query_prediction <SelectBackward object at 0x7f20ea3fb3d0>
>  positive_prediction <SelectBackward object at 0x7f20ea3fb3d0>
>  negative_prediction <SelectBackward object at 0x7f20ea3fb3d0>
>  dist_neg <RsubBackward1 object at 0x7f20ea3fb3d0>
>  dist_neg <RsubBackward1 object at 0x7f20ea3fb3d0>

This would mean that at least the model output is attached to the graph, so you could check the grad_fn attributes of previous activations and check, if any yields a None.

Thanks for the explanation!

How can I do that?

When checking the parameters like so:

for p in model.parameters():
    print(p.grad_fn)

all I get is None.

The parameters don’t have any grad_fn, as they are leaf nodes, so you would need to check the forward activations either directly in the forward method, e.g. via:

def forward(self, x):
    x = self.layer(x)
    print(x.grad_fn)
    ...

or via forward hooks.

Maybe something similar?


def hook(name, output):
    activation[name] = output[0].detach()

query_prediction = model.encode("a man is cutting up a tomato", convert_to_tensor=True)



for name, layer in model.named_modules():
    layer.register_forward_hook(hook(name,query_prediction))
print(activation)

The output tensors are all of the same vaue.

{'': tensor(-0.0094), 
'0': tensor(-0.0094),
 '0.auto_model': tensor(-0.0094), 
'0.auto_model.embeddings': tensor(-0.0094), 
'0.auto_model.embeddings.word_embeddings': tensor(-0.0094), 
.
.
.
'1': tensor(-0.0094)}