Variables are not updated after loss.backward() and optimizer.step()

Hello,

I’m trying to create a multi-label classification based on BERT fine-tuning. However, model parameters are not updated after loss.backward() and optimizer.step(), So my model is never training.

I already check that all model parameters have a True value of requires_grad.
Also, I followed graph calculation to be sure that all variable have also a True value of requires_grad

For precision, I use Pytorch 1.7.0

Here are my model’s layers :

UNERLinearModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        ....
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (size_embeddings): Embedding(300, 300)
  (lstm): BiLstmContextuelLayer(
    (layers): ModuleList(
      (0): BidirLSTMLayer(
        (directions): ModuleList(
          (0): LSTMLayer(
            (cell): CustomLSTMCell()
          )
          (1): ReverseLSTMLayer(
            (cell): CustomLSTMCell()
          )
        )
      )
      (1): BidirLSTMLayer(
        (directions): ModuleList(
          (0): LSTMLayer(
            (cell): CustomLSTMCell()
          )
          (1): ReverseLSTMLayer(
            (cell): CustomLSTMCell()
          )
        )
      )
      (2): BidirLSTMLayer(
        (directions): ModuleList(
          (0): LSTMLayer(
            (cell): CustomLSTMCell()
          )
          (1): ReverseLSTMLayer(
            (cell): CustomLSTMCell()
          )
        )
      )
    )
  )
  (entity_classifier): Linear(in_features=1, out_features=3, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)

And my loss class :

loss_function = torch.nn.BCEWithLogitsLoss()

entity_logits = torch.tensor(entity_logits.float(), requires_grad=True)
entity_types = entity_types.view(-1).float()
entity_masks = entity_masks.view(-1).float()

a = self._model.lstm.layers[0].directions[0].cell.W.clone()

train_loss = loss_function(entity_logits, entity_types)
train_loss = (train_loss * entity_masks).sum() / entity_masks.sum()

# Code for only debug : print all model's paramters to update
for name, param in self._model.named_parameters():
            if param.requires_grad:
                print(name, param)'

# backward Loss is for updating parameters of all model layers
train_loss.backward(retain_graph=True)
torch.nn.utils.clip_grad_norm_(self._model.parameters(), self._max_grad_norm)

self._optimizer.step()
self._scheduler.step()
self._model.zero_grad()

b = self._model.lstm.layers[0].directions[0].cell.W.clone()
print(f"parameters update {not(torch.equal(a, b))}")

return train_loss.item()

How can I correctly debug this issue?
How to be sure that the graph is not broken somewhere?
How to be sure that loss backward is executed correctly?

Please help me with any suggestion, It’s been 4 days that I am stuck on this topic.

Thank you very much.

Hi, can you try it out, without your clipping. Maybe its clipped too tightly and then you do not lose your computational graph, but your mathematical gradient.

Unfortunatly, I already try it. I deleted the line :

torch.nn.utils.clip_grad_norm_(self._model.parameters(), self._max_grad_norm)

But always the same result, no parameter is updated.

Does your self._optimizer try to update the correct parameters?

1 Like

How can I check this?

However, Optimizer contains params with requires_grad as True. I checked it with the following code:

 for param in self._optimizer.param_groups[0]['params']:
      if param.requires_grad:
              print(param.grad)

Note that all params’ grad is equal to “None”.

For information, I use “AdamW” as an optimizer.

Actually that looks fine.
Maybe your update works. Can you print a, before your optimizer step line and then print b after the line and check if they are the same?

The two variables a and b are always equal.
Any other suggestions for debugging?

@CedricLy Thanks for your help.

Check the .grad attribute of all parameters, which don’t seem to be updated and see if the gradients have valid values or are all zero or even None.
If they are set to None after the first backward and before the first zero_grad() call, your computation graph might be broken. On the other hand, if they are zero, some operations in the model might kill the gradient.
Make sure that you check the .grad attributes after the backward and before the zero_grad operation.

1 Like

Finally, and after 5 days, I found the error.
In fact, the computational graph was broken into two different places, due to two wrong operations. However, it was very difficult to debug it and find the issue source. No tools or Libs exist to visualize the graph, which is the main component for the gradient backpropagation.

Hi @arhouati, I have similar issues
How were you able to debug this??

Hi @arhouati,

Thanks for reporting your issue and progress on it here.
I believe I ran into the same issue while finetuning a BERT-based model.
My code seems to be similar to what you implemented, and I tried all other “parameters are not updating” suggestions (for some days now).
If you have time, it would help a lot if you could share which operations made your computational graph break. Does it break where you wrap the entity_logits into a tensor?

Thanks,
Madelon


Update: I just solved my problem. I had to convert my sample labels and another variable (sample size) to pytorch Variables (pytorch.autograd.Variables) to be included in the computational graph. Hope this is helpful for future readers.