Why loss function always return zero after first epoch?

Why the loss function is always printing zero after the first epoch?

I suspect it’s because of loss = loss_fn(outputs, torch.max(labels, 1)[1]).

And if if I use loss = loss_fn(outputs, torch.max(labels, 1)[0]), I will get some values that are too high and I’m not sure if they make sense, like: 1200,800,600,500(one value for each epoc)

nepochs = 5

losses = np.zeros(nepochs)

loss_fn = nn.CrossEntropyLoss()

optimizer = optim.Adam(modell.parameters(), lr = 0.001)

for epoch in range(nepochs):

    running_loss = 0.0
    n = 0
    
    for data in train_loader:
        
        #single batch
        if(n == 1):
            break;
            
        inputs, labels = data
        
        optimizer.zero_grad()

        outputs = modell(inputs)
        
        #loss = loss_fn(outputs, labels)
        loss = loss_fn(outputs, torch.max(labels, 1)[1])
        loss.backward()
        optimizer.step()
    
        running_loss += loss.item()
        n += 1
       
    losses[epoch] = running_loss / n
    print(f"epoch: {epoch+1} loss: {losses[epoch] : .3f}")

The model is:

def __init__(self, labels=10):
    super(Classifier, self).__init__()
    self.fc = nn.Linear(3 * 64 * 64, labels)
    
def forward(self, x):
    out = x.reshape(x.size(0), -1) 
    out = self.fc (out)
    return out

the labels variable is a 64 tensor size like this
tensor([[7],[1],[ 2],[3],[ 2],[9],[9],[8],[9],[8],[ 1],[7],[9],[2],[ 5],[1],[3],[3],[8],[3],[7],[1],[7],[9],[8],[ 8],[3],[7],[ 5],[ 1],[7],[3],[2],[1],[ 3],[3],[2],[0],[3],[4],[0],[7],[1],[ 8],[4],[1],[ 5],[ 3],[4],[3],[ 4],[8],[4],[1],[ 9],[7],[3],[ 2],[ 6],[4],[ 8],[3],[ 7],[3]])

Your labels tensor seems to already contain class indices but has an additional unnecessary dimension.
The right approach would be to use labels = labels.squeeze(1) and pass it to the criterion.
Using torch.max(labels, dim=1)[0] would yield the same output.
However, torch.max(labels, dim=1)[1] would return the indices in dim1 containing the max value, which would be a tensor full of zeros which is wrong and would thus explain the zero loss as your model would only learn to predict class0.

1 Like

Hi @ptrblck , Im facing a same/similar problem and was wondering if you could help me out. Obviously Im also kind of new to all of this. Im trying to do semantic segmentation and my masks are not one hot encoded.

My images have the shapetorch.Size([3, 320, 320]) and my masks the shapetorch.Size([1, 320, 320]).
This is my code:

learning_rate = 0.0005
batch_size = 32
#image_size = 224
num_epochs = 10

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('The current processor is ...', device)
# specifying loss function
criterion = nn.CrossEntropyLoss()

# specifying the network
model = smp.Unet('resnet34', encoder_weights='imagenet', classes=1)

# specifying optimizer
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

if torch.cuda.is_available():
    model.cuda()
n_total_steps=len(train_loader)

#training loop
for epoch in range(num_epochs):
    for i, (images, masks) in enumerate(tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")):
        
        images=images.to(device)
        masks=masks.to(device)
        
        #forward pass
        outputs=model(images)
        loss=criterion(outputs, masks)
 
        #backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(loss.item())
            
print('Finished Training')

Im also trying to get the Jaccard Index to work (MulticlassJaccardIndex) but i always get the following error: RuntimeError: bincount only supports 1-d non-negative integral inputs.

Thanks in advance for the help!

I’m unsure where and how exactly MulticlassJaccardIndex is called as your code didn’t show it. However, I guess your targets might not contain class indices as an integer type (int32 or long==int64). If so, either transform the target directly or call torch.argmax if your target contains probabilities

@ptrblck thanks for replying! I may not have expressed myself correctly, Im also facing the problem of my loss always being 0. The code above was my training without any kind of Jaccard Index Implementation.

The following code snippet was my attempt of including MulticlassJaccardIndex. But here I always get the error RuntimeError: bincount only supports 1-d non-negative integral inputs.

n_total_steps=len(train_loader)

#training loop
for epoch in range(num_epochs):
    for i, (images, masks) in enumerate(tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")):
        
        images=images.to(device)
        masks=masks.to(device)
        
        #forward pass
        outputs=model(images)
        loss=criterion(outputs, masks)
        
        #iou score
        mcj_index = MulticlassJaccardIndex(num_classes=5)
        jaccard = mcj_index(outputs, masks)
        
        #backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        print(f"Epoch [{epoch+1}/{num_epochs}], Step [{i+1}/{len(train_loader)}], Jaccard Index: {jaccard.item():.4f}")
            
print('Finished Training')

Thanks in advance for the help!