Gradient not flowing throughout the entire process but loss is reducing

Hi. I’m not exactly sure if this might be the appropriate place to post this but figured I could use any help I can get.

I’m currently trying to train a neural network model but have noticed that it’s not training properly. I noticed this because I’m currently aiming to train it to perform binary classification, and after seeing that it only predicts 1 during test time I tried to train it using data with labels that are only 0. However, it still managed to predict mainly 1’s (95% of the time in fact).

After seeing this, I plotted the average gradients per layer for each epoch, and the plot looks something like this:

enter image description here

The overall code framework is something as follows:

import torch
import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self, num, dim):
        self.features_linear = nn.Linear(in_features=300, out_features=100)
        self.embedding = nn.Embedding(num_embeddings=num, embedding_dim=10)
        self.output_linear = nn.Linear(in_features=(100 + num), out_features=1)
        self.dropout = nn.Dropout(p=0.8)

    def forward(self, a, b):
        a2 = F.relu(self.features_linear(a))
        b_emb = self.embedding(b)
        emb_sum = torch.sum(b_emb, dim=1)

        input_rep = torch.cat((a2, emb_sum), dim=1)
        input_rep = self.dropout(input_rep)
        output = self.output_linear(input_rep)

        return output


model = Model(16, 100)

if torch.cuda.is_available():
    model = model.to('cuda')

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(params=model.parameters())

for epoch in range(num_epochs):
    for batch in training_loader:
        optimizer.zero_grad()

        a, b, label = batch

        if torch.cuda.is_available():
            a = a.to('cuda')
            b = b.to('cuda')
            label = label.to('cuda')

        output = model(a, b)

        loss = criterion(output, label)
        epoch_loss += loss.item()
        loss.backward()
        optimizer.step()

What’s especially perplexing is that the loss value decreases rather ideally (I don’t have a plot for that ATM).

Would anyone have any idea on what I might be doing wrong? Any tips or thoughts are appreciated. Thanks.

Hello,
are your input (images?) rescaled between 0 and 1 ?
I saw that your model output is 1. Do you mean to use sigmoid activation ? if so, then it is not included in BCELoss.

If that is the problem, some parts of this discussion may help:

Hi @pfloat sorry for the late reply. I am using the sigmoid activation function right now. Perhaps I could change it to ReLU and see how that works.