Variables are not updated after loss.backward() and optimizer.step()

arhouati · February 17, 2021, 7:48am

Hello,

I’m trying to create a multi-label classification based on BERT fine-tuning. However, model parameters are not updated after loss.backward() and optimizer.step(), So my model is never training.

I already check that all model parameters have a True value of requires_grad.
Also, I followed graph calculation to be sure that all variable have also a True value of requires_grad

For precision, I use Pytorch 1.7.0

Here are my model’s layers :

UNERLinearModel(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        ....
        (11): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
          )
          (intermediate): BertIntermediate(
            (dense): Linear(in_features=768, out_features=3072, bias=True)
          )
          (output): BertOutput(
            (dense): Linear(in_features=3072, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (pooler): BertPooler(
      (dense): Linear(in_features=768, out_features=768, bias=True)
      (activation): Tanh()
    )
  )
  (size_embeddings): Embedding(300, 300)
  (lstm): BiLstmContextuelLayer(
    (layers): ModuleList(
      (0): BidirLSTMLayer(
        (directions): ModuleList(
          (0): LSTMLayer(
            (cell): CustomLSTMCell()
          )
          (1): ReverseLSTMLayer(
            (cell): CustomLSTMCell()
          )
        )
      )
      (1): BidirLSTMLayer(
        (directions): ModuleList(
          (0): LSTMLayer(
            (cell): CustomLSTMCell()
          )
          (1): ReverseLSTMLayer(
            (cell): CustomLSTMCell()
          )
        )
      )
      (2): BidirLSTMLayer(
        (directions): ModuleList(
          (0): LSTMLayer(
            (cell): CustomLSTMCell()
          )
          (1): ReverseLSTMLayer(
            (cell): CustomLSTMCell()
          )
        )
      )
    )
  )
  (entity_classifier): Linear(in_features=1, out_features=3, bias=True)
  (dropout): Dropout(p=0.1, inplace=False)

And my loss class :

loss_function = torch.nn.BCEWithLogitsLoss()

entity_logits = torch.tensor(entity_logits.float(), requires_grad=True)
entity_types = entity_types.view(-1).float()
entity_masks = entity_masks.view(-1).float()

a = self._model.lstm.layers[0].directions[0].cell.W.clone()

train_loss = loss_function(entity_logits, entity_types)
train_loss = (train_loss * entity_masks).sum() / entity_masks.sum()

# Code for only debug : print all model's paramters to update
for name, param in self._model.named_parameters():
            if param.requires_grad:
                print(name, param)'

# backward Loss is for updating parameters of all model layers
train_loss.backward(retain_graph=True)
torch.nn.utils.clip_grad_norm_(self._model.parameters(), self._max_grad_norm)

self._optimizer.step()
self._scheduler.step()
self._model.zero_grad()

b = self._model.lstm.layers[0].directions[0].cell.W.clone()
print(f"parameters update {not(torch.equal(a, b))}")

return train_loss.item()

How can I correctly debug this issue?
How to be sure that the graph is not broken somewhere?
How to be sure that loss backward is executed correctly?

Please help me with any suggestion, It’s been 4 days that I am stuck on this topic.

Thank you very much.

CedricLy · February 17, 2021, 7:52am

Hi, can you try it out, without your clipping. Maybe its clipped too tightly and then you do not lose your computational graph, but your mathematical gradient.

arhouati · February 17, 2021, 8:30am

Unfortunatly, I already try it. I deleted the line :

torch.nn.utils.clip_grad_norm_(self._model.parameters(), self._max_grad_norm)

But always the same result, no parameter is updated.

CedricLy · February 17, 2021, 10:22am

Does your self._optimizer try to update the correct parameters?

arhouati · February 17, 2021, 11:23am

How can I check this?

However, Optimizer contains params with requires_grad as True. I checked it with the following code:

 for param in self._optimizer.param_groups[0]['params']:
      if param.requires_grad:
              print(param.grad)

Note that all params’ grad is equal to “None”.

For information, I use “AdamW” as an optimizer.

CedricLy · February 18, 2021, 6:37am

Actually that looks fine.
Maybe your update works. Can you print a, before your optimizer step line and then print b after the line and check if they are the same?

arhouati · February 18, 2021, 11:31am

The two variables a and b are always equal.
Any other suggestions for debugging?

@CedricLy Thanks for your help.

ptrblck · February 19, 2021, 7:18am

Check the .grad attribute of all parameters, which don’t seem to be updated and see if the gradients have valid values or are all zero or even None.
If they are set to None after the first backward and before the first zero_grad() call, your computation graph might be broken. On the other hand, if they are zero, some operations in the model might kill the gradient.
Make sure that you check the .grad attributes after the backward and before the zero_grad operation.

arhouati · February 19, 2021, 6:32pm

Finally, and after 5 days, I found the error.
In fact, the computational graph was broken into two different places, due to two wrong operations. However, it was very difficult to debug it and find the issue source. No tools or Libs exist to visualize the graph, which is the main component for the gradient backpropagation.

ToluClassics · October 12, 2021, 2:47am

Hi @arhouati, I have similar issues
How were you able to debug this??

madelonh · November 15, 2021, 5:20pm

Hi @arhouati,

Thanks for reporting your issue and progress on it here.
I believe I ran into the same issue while finetuning a BERT-based model.
My code seems to be similar to what you implemented, and I tried all other “parameters are not updating” suggestions (for some days now).
If you have time, it would help a lot if you could share which operations made your computational graph break. Does it break where you wrap the entity_logits into a tensor?

Thanks,
Madelon

Update: I just solved my problem. I had to convert my sample labels and another variable (sample size) to pytorch Variables (pytorch.autograd.Variables) to be included in the computational graph. Hope this is helpful for future readers.