Problem with the training loss of FCN for segmentation

                  Dear programmers,

I am very new to Pytorch and with very poor programming experience. I have built a network and the training process is as follows:

Epoch_num=5

for e in range(Epoch_num):

train_loss = 0

model= model.train()

for idx, data in tqdm (enumerate(train_loader)):

    optimizer.zero_grad()

    x, y_true = data

    if torch.cuda.is_available():

        x, y_true = x.cuda(), y_true.cuda()

    # forward

    out = model(x)

    out = F.log_softmax(out, dim=1)  # (b, n, h, w)

    loss = criterion(out, y_true)

    # backward

    optimizer.zero_grad()

    loss.backward()

    optimizer.step()

    train_loss += loss.item()

    


label_pred = out.max(dim=1)[1].data.cpu()

label_true= y_true.unsqueeze(1)

label_true = label_true .data.cpu()

acc= get_accuracy(label_true,label_pred)

print("Epoch {}/{}, Loss: {:.3f}, Accuracy: {:.3f}".format(e+1,Epoch_num, train_loss, acc))

The training process is illustrated as follows:

Epoch 1/5, Loss: 765.190, Accuracy: 0.513

0it [00:00, ?it/s]
1it [00:00, 6.00it/s]
2it [00:00, 6.06it/s]
3it [00:00, 6.12it/s]
4it [00:00, 6.11it/s]
5it [00:00, 6.11it/s]
6it [00:00, 6.16it/s]
7it [00:01, 6.11it/s]
8it [00:01, 6.17it/s]
9it [00:01, 6.14it/s]
10it [00:01, 6.14it/s]
11it [00:01, 6.03it/s]
12it [00:01, 5.90it/s]
13it [00:02, 5.96it/s]
14it [00:02, 6.03it/s]
15it [00:02, 6.13it/s]
16it [00:02, 6.05it/s]
17it [00:02, 6.11it/s]
18it [00:02, 6.17it/s]
19it [00:03, 6.13it/s
51it [00:08, 6.13it/s]
52it [00:08, 6.13it/s]
53it [00:08, 6.09it/s]
54it [00:08, 6.04it/s]
55it [00:09, 6.11it/s]
56it [00:09, 6.10it/s]
57it [00:09, 6.16it/s]
58it [00:09, 5.97it/s]
59it [00:09, 6.02it/s]
60it [00:09, 6.08it/s]
61it [00:09, 6.10it/s]
62it [00:10, 6.06it/s]
63it [00:10, 6.02it/s]
64it [00:10, 6.13it/s]
65it [00:10, 6.11it/s]
66it [00:10, 6.11it/s]
67it [00:10, 6.15it/s]
68it [00:11, 6.14it/s]
69it [00:11, 6.15it/s]
70it [00:11, 6.19it/s]
71it [00:11, 6.11it/s]
72it [00:11, 6.11it/s]
73it [00:11, 6.14it/s]
74it [00:12, 6.14it/s]
75it [00:12, 6.04it/s]
76it [00:12, 6.06it/s]
77it [00:12, 6.09it/s

1101it [03:00, 6.08it/s]
1102it [03:00, 6.12it/s]
1103it [03:00, 6.20it/s]
1104it [03:00, 6.22it/s]
Epoch 2/5, Loss: 765.112, Accuracy: 0.514

1103it [02:54, 6.36it/s]
1104it [02:55, 6.29it/s]
Epoch 3/5, Loss: 764.840, Accuracy: 0.535

1104it [03:00, 6.10it/s]
Epoch 4/5, Loss: 761.322, Accuracy: 0.704

It can be seen that the loss does decrease while the accuracy increases. However, the loss is still very high. Please, could you help me check what is wrong with my implementation? Moreover, how can I modify the codes in such a way that not every mini-batch will be displayed during training?

Thank you very much for your time and guidance

Based on the usage of F.log_softmax I assume you are using nn.NLLLoss as your criterion?
If so, did you pass any arguments as the reduction or are you using the default settings?

To remove the printing you might just want to remove the tqdm call. :wink:

I am very grateful for your reply sir.
I was using the default setting. After your suggestion, I went and read the documentation. However, I still can not figure it out.

when I set reduction=‘sum’, the loss is ‘Nan’
when I set reduction=‘none’, it gives the following error ‘RuntimeError: grad can be implicitly created only for scalar outputs’
I have searched online, but could not find any suitable solutions

As for the printing, removing ‘tqdm’ call solved the problem. However, is there any way by which I could print the loss after a certain number of batches?

Thank you for your time and guidance

How many classes are you using?
The loss value seem to be quite high for a reasonable accuracy, but might of course still be valid.

If you don’t reduce the loss, you would have to pass the gradient into backward with the same shape as your output.

The usual approach of printing some statistics is to use a condition inside your training loop as seen e.g. here.

My task is a binary segmentation (foreground and background)

If you don’t reduce the loss, you would have to pass the gradient into backward with the same shape as your output. Please, could you help me with an example?

I have read the work you recommended and I have modified my codes as follows:

Epoch_num=5

for e in range(Epoch_num):

print('Epoch', e+1)

train_loss = 0

model= model.train()



for idx, data in enumerate(train_loader):

#for idx, data in tqdm (enumerate(train_loader)):

    optimizer.zero_grad()

    x, y_true = data

    if torch.cuda.is_available():

        x, y_true = x.cuda(), y_true.cuda()

    # forward

    out = model(x)

    out = F.log_softmax(out, dim=1)  # (b, n, h, w)

    #loss = criterion(out[0,0,:,:], y_true[0,:,:].type_as(out))

    loss = criterion(out, y_true)

    # backward

    optimizer.zero_grad()

    loss.backward()

    optimizer.step()

    train_loss += loss.item()

    if idx % 100 == 0 and idx != 0:

        print('[{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(

             idx , len(train_loader.dataset),

            100. * idx / len(train_loader), train_loss/ idx))

       

label_pred = out.max(dim=1)[1].data.cpu()

label_true= y_true.unsqueeze(1)

label_true = label_true .data.cpu()

acc= get_accuracy(label_true,label_pred)

print("Epoch {}/{}, Loss: {:.5f}, Accuracy: {:.5f}".format(e+1,Epoch_num, train_loss/len(train_loader.dataset), acc))

As it can be seen, I have divided the total training loss by the number of samples in the training data. Please, kindly check and see whether doing so is logical or not.

Some statistics of the training process are given below

Epoch 1

[100/1104 (9%)] Loss: 0.700105

[200/1104 (18%)] Loss: 0.696637

[300/1104 (27%)] Loss: 0.695482

[400/1104 (36%)] Loss: 0.694904

[500/1104 (45%)] Loss: 0.694557

[600/1104 (54%)] Loss: 0.694326

[700/1104 (63%)] Loss: 0.694161

[800/1104 (72%)] Loss: 0.694037

[900/1104 (82%)] Loss: 0.693941

[1000/1104 (91%)] Loss: 0.693863

[1100/1104 (100%)] Loss: 0.693800

Epoch 1/5, Loss: 0.69317, Accuracy: 0.48804

Epoch 2

[100/1104 (9%)] Loss: 0.700100

[200/1104 (18%)] Loss: 0.696634

[300/1104 (27%)] Loss: 0.695478

[400/1104 (36%)] Loss: 0.694899

[500/1104 (45%)] Loss: 0.694552

[600/1104 (54%)] Loss: 0.694321

[700/1104 (63%)] Loss: 0.694156

[800/1104 (72%)] Loss: 0.694032

[900/1104 (82%)] Loss: 0.693935

[1000/1104 (91%)] Loss: 0.693858

[1100/1104 (100%)] Loss: 0.693795

Epoch 2/5, Loss: 0.69317, Accuracy: 0.48364

Epoch 3

[100/1104 (9%)] Loss: 0.700095

[200/1104 (18%)] Loss: 0.696631

[300/1104 (27%)] Loss: 0.695475

[400/1104 (36%)] Loss: 0.694896

[500/1104 (45%)] Loss: 0.694549

[600/1104 (54%)] Loss: 0.694317

[700/1104 (63%)] Loss: 0.694152

[800/1104 (72%)] Loss: 0.694028

[900/1104 (82%)] Loss: 0.693932

[1000/1104 (91%)] Loss: 0.693855

[1100/1104 (100%)] Loss: 0.693791

Epoch 3/5, Loss: 0.69316, Accuracy: 0.49097

Epoch 4

[100/1104 (9%)] Loss: 0.700089

[200/1104 (18%)] Loss: 0.696625

[300/1104 (27%)] Loss: 0.695470

[400/1104 (36%)] Loss: 0.694892

[500/1104 (45%)] Loss: 0.694545

[600/1104 (54%)] Loss: 0.694314

[700/1104 (63%)] Loss: 0.694149

[800/1104 (72%)] Loss: 0.694025

[900/1104 (82%)] Loss: 0.693928

[1000/1104 (91%)] Loss: 0.693851

[1100/1104 (100%)] Loss: 0.693788

Epoch 4/5, Loss: 0.69316, Accuracy: 0.50049