Optimizing Model Parameters Issue with 'loss.backward()' Function

import gc
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
torch.set_printoptions(sci_mode=False)
 
gradient_norms = []
losses = []

for epoch in tqdm(range(epochs)):
    model_path = f"/kaggle/working/Training/ace_state_dict_{epoch+1}.pth"
    torch.save(model.state_dict(), model_path)
    model.train()
    total_loss = 0

    for batch_idx, batch in enumerate(tqdm(train_dataloader, desc=f'Epoch {epoch + 1}/{epochs}')):
        optimizer.zero_grad()
    
        logits = model(batch["inputs"])
        targets = batch["targets"]

        loss = loss_fn(logits.view(-1, logits.size(-1)), targets.float()) / 1000000000
        loss.backward()

        # Compute gradient norms
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
        gradient_norms.append(grad_norm)

        optimizer.step()
        scheduler.step()
        total_loss += loss.item()

        if batch_idx % 100 == 0:
            print(f'Batch {batch_idx}/{len(train_dataloader)}, Loss: {total_loss/(batch_idx+1)}, Gradient Norms: {grad_norm}')
 
    avg_loss = total_loss / len(train_dataloader)
    losses.append(avg_loss)

    print(f"Train loss: {avg_loss}")

In my code above, the “loss.backward()” function is not working as expected, resulting in a “grad_norm” consistently registering as “0.0”. This impedes the optimization of the model’s parameters during each iteration. Any guidance or suggestions to rectify this matter would be greatly appreciated. For reference, you can find the full source code of my model at:

TransformerChatbot | Kaggle

I’ve experimented with various combinations of hyperparameters and learning rates, but unfortunately, the issue persists. Additionally, I attempted to address it by employing “torch.nn.utils.clip_grad_norm_()”, but to no avail. All the relevant tensors and the model’s parameters are set with “requires_grad=True”. Any insights or alternative approaches to resolve this

Hi @CuteDeadu,

Do you need to scale your loss values by a factor of 1000000000? Surely this would make any gradient values near 0? Also, could you try casting your targets as type Tensor rather than type float?

Screenshot_20240506-183603~2

I’ve attempted training without dividing the loss by 10^9, but unfortunately, encountered the same issue. Regarding your query about casting targets as type Tensor rather than type float, I’ve experimented with both approaches, yet the problem persists. The reason for initially dividing by 10^9 is because the loss values tend to be quite large

What is the loss_fn here?

It’s nn.CrossEntropyLoss() function and the optimizeris Adam()

For the nn.CrossEntropyLoss() function, doesn’t the input for that loss function need to be the normalized probability from the Softmax function (rather than the logits?), as shown in the docs: CrossEntropyLoss — PyTorch 2.3 documentation

That might explain why your initial loss is too high, and this would also allow you to remove the 1e9 factor.

Thank you so much for your suggestion! I adjusted the input for the nn.CrossEntropyLoss() function to be the normalized probability from the Softmax function instead of the logits, as per the documentation you provided. This solution indeed resolved the issue, and I was able to remove the 1e9 factor. I sincerely appreciate your time and assistance in helping me address this problem.

No, the inputs (i.e. model outputs) are supposed to be logits as internally F.nll_loss(F.log_softmax()) will be used.
From the docs:

The input is expected to contain the unnormalized logits for each class (which do not need to be positive or sum to 1, in general).

The targets are supposed to be either class indices in the range [0, nb_classes-1] or “soft” targets given as probabilities.

1 Like