Getting element 0 error while fine tuning t5 llm

This is the error I am getting
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

I am trying to fine tune the t5 llm model here.
The Model:

class LModel(pl.LightningModule):
    def __init__(self):
        super(LModel, self).__init__()
        self.model = T5ForConditionalGeneration.from_pretrained("t5-base", return_dict=True)
  
    def forward(self, input_ids, attention_mask, labels=None, decoder_attention_mask=None):
        outputs = self.model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        labels=labels,
                        decoder_attention_mask=decoder_attention_mask)
        return outputs.loss, outputs.logits
  
    def training_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["summary_ids"]
        decoder_attention_mask = batch["summary_mask"]

        loss, output = self(input_ids, attention_mask, labels, decoder_attention_mask)
        return loss

    def validation_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["summary_ids"]
        decoder_attention_mask = batch["summary_mask"]

        loss, output = self(input_ids, attention_mask, labels, decoder_attention_mask)
        return loss
 
    def test_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        loss, output = self(input_ids=input_ids, 
                      attention_mask=attention_mask)
        return loss
    
    def configure_optimizers(self):
        optimizer = AdamW(self.model.parameters(), lr=0.0001)
        scheduler = get_linear_schedule_with_warmup(
                optimizer, num_warmup_steps=0,
                num_training_steps=EPOCHS*len(df))
        return {'optimizer': optimizer, 'lr_scheduler': scheduler}

And the trainer looks like this:

device  = 'cuda' if torch.cuda.is_available() else "cpu"
trainer = pl.Trainer(
    max_epochs=EPOCHS,
    accelerator=device
)


trainer.fit(model,module)

can anyone let me know why am I getting this element 0 error and how to fix it?

This error is raised if you are detaching a tensor in the forward pass explicitly, if you are using non-differentiable operations, or if you are disabling gradient calculations globally.
I don’t know how the forward method is implemented in your model but you could print the .grad_fn attribute of each intermediate tensor to check if a valid backward method is returned to isolate where the detaching takes place.

Hi @ptrblck ,

I tried printing the grad_fn :

def forward(self, input_ids, attention_mask, labels=None, decoder_attention_mask=None):
        outputs = self.model(input_ids=input_ids,
                        attention_mask=attention_mask,
                        labels=labels,
                        decoder_attention_mask=decoder_attention_mask)
        print("outputs.logits.grad_fn:",outputs.logits.grad_fn)
        print("total params reuires grad",sum(p.numel() for p in self.model.parameters() if p.requires_grad))
        print("model.training:",model.training)
        return outputs.loss, outputs.logits

I got this output:

outputs.logits.grad_fn: None
total params reuires grad 222903552
model.training:True

What does output.loss.grad_fn return? Maybe the .logits were explicitly detached but the .loss is still differentiable?
If the .loss.grad_fn is also set to None you would need to check the model’s forward pass.

outputs.logits.grad_fn: None
outputs.loss.grad_fn: None

Yes the loss grad_fn is also shown as None.

Its strange, because I am not deliberately changing anything inside the model, and directly using the pretrained version.

Yes, it’s indeed strange and you would thus need to check the internals of this model and how it was implemented as either the model or script might disable gradient calculations globally or via with torch.no_grad() for some reason that I’m unaware of.

yes, I rewrote the code in PyTorch and it worked fine, maybe some issue with PyTorch lightning wrapper.

To debug it further, you could use print(torch.is_grad_enabled()) to see if something disabled gradient calculation (by mistake).

sure @ptrblck , will try it. Thanks a lot, I will update the thread with solution if found.