When I train my network, The same number appears periodically on my my training acc , I don't konow the error come from

this is my train.py I use crosssentropy loss for my network , outputs size is [4,2,224,224] where 4 means batchsize, 2 means channels, 224 means h and w .output_c1 size is [4,224,224] ,labels size is [4,224,224] too.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from model import U_net
import visdom
from dataset import driveDateset
from torch import optim
from Dice_loss import DiceLoss
from Dice_loss import MulticlassDiceLoss
import matplotlib.pylab as plt
import numpy as np
import time

if __name__ == '__main__':
    DATA_DIRECTORY = "F:\\experiment_code\\U-net\\DRIVE\\training"
    DATA_LIST_PATH = "F:\\experiment_code\\U-net\DRIVE\\training\\images_id.txt"
    Batch_size = 4
    epochs = 100
    dst = driveDateset(DATA_DIRECTORY, DATA_LIST_PATH)

    # Initialize model
    device = torch.device("cuda")
    model = U_net()
    model.to(device)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    criteon = nn.CrossEntropyLoss() #reduce=False
    best_acc, best_epoch =0, 0
    global_step = 0
    start_time = time.time()
    viz = visdom.Visdom()
   for epoch in range(epochs):
        running_corrects = 0
        since_epoch = time.time()
        trainloader = torch.utils.data.DataLoader(dst, batch_size=Batch_size) #,shuffle =True
        for step, data in enumerate(trainloader):
            imgs, labels, _, _ = data
            imgs, labels = imgs.to(device), labels.to(device)
            labels = labels.long()
            model.train()
            outputs = model(imgs)   # output  B * C * H *W
            output_c1 = outputs[:,0,:,:] # C are 2 channels ,I choose the second channel
            Rounding_output_c1 = torch.round(output_c1)
            Rounding_output_c11 = torch.stack([(Rounding_output_c1 == i).float() for i in range(256)]) #[4,224,224]->[256,4,224,224] 256 is the number of classes, means pixel from 0-255
            Rounding_output_c11 = Rounding_output_c11.permute(1,0,2,3) #[256,4,224,224]->[4,256,224,224]
            loss = criteon(Rounding_output_c11,labels)
            loss.requires_grad = True
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            viz.line([loss.item()],[global_step], win='loss', update='append',opts=dict(title='train_loss'))
            labels_float = labels.float()
            running_corrects = torch.sum(Rounding_output_c1 == labels_float).float()
            labels_size = labels.size(1) * labels.size(2) * 4
            training_acc = running_corrects / labels_size
            time_elapsed_epoch = time.time() - since_epoch
            print('epoch :', epoch, '\t', 'loss:', loss.item(),'\t','training_acc',training_acc,'\t','{:.0f}m {:.0f}s'.format(time_elapsed_epoch // 60, time_elapsed_epoch % 60))
            global_step += 1

I test Data incoming has no problem ,every training data has Traversed. training data and label like this


but when I train my network ,output picture like this ,
LbUEh
and some result like this :

epoch : 0 	 loss: 5.2415571212768555 	 training_acc tensor(0.3103, device='cuda:0') 	 0m 2s
epoch : 0 	 loss: 5.228370666503906 	 training_acc tensor(0.3235, device='cuda:0') 	 0m 2s
epoch : 0 	 loss: 5.224219799041748 	 training_acc tensor(0.3276, device='cuda:0') 	 0m 2s
epoch : 0 	 loss: 5.222436428070068 	 training_acc tensor(0.3294, device='cuda:0') 	 0m 2s
epoch : 0 	 loss: 5.2202887535095215 	 training_acc tensor(0.3316, device='cuda:0') 	 0m 2s
epoch : 1 	 loss: 5.2415571212768555 	 training_acc tensor(0.3103, device='cuda:0') 	 0m 0s
epoch : 1 	 loss: 5.22836971282959 	 training_acc tensor(0.3235, device='cuda:0') 	 0m 0s
epoch : 1 	 loss: 5.224219799041748 	 training_acc tensor(0.3276, device='cuda:0') 	 0m 0s
epoch : 1 	 loss: 5.222436428070068 	 training_acc tensor(0.3294, device='cuda:0') 	 0m 1s
epoch : 1 	 loss: 5.2202887535095215 	 training_acc tensor(0.3316, device='cuda:0') 	 0m 1s
epoch : 2 	 loss: 5.2415571212768555 	 training_acc tensor(0.3103, device='cuda:0') 	 0m 0s
epoch : 2 	 loss: 5.22836971282959 	 training_acc tensor(0.3235, device='cuda:0') 	 0m 0s
epoch : 2 	 loss: 5.224219799041748 	 training_acc tensor(0.3276, device='cuda:0') 	 0m 0s
epoch : 2 	 loss: 5.222436428070068 	 training_acc tensor(0.3294, device='cuda:0') 	 0m 1s
epoch : 2 	 loss: 5.2202887535095215 	 training_acc tensor(0.3316, device='cuda:0') 	 0m 1s
epoch : 3 	 loss: 5.2415571212768555 	 training_acc tensor(0.3103, device='cuda:0') 	 0m 0s
epoch : 3 	 loss: 5.22836971282959 	 training_acc tensor(0.3235, device='cuda:0') 	 0m 0s
epoch : 3 	 loss: 5.224219799041748 	 training_acc tensor(0.3276, device='cuda:0') 	 0m 0s
epoch : 3 	 loss: 5.222436428070068 	 training_acc tensor(0.3294, device='cuda:0') 	 0m 1s
epoch : 3 	 loss: 5.2202887535095215 	 training_acc tensor(0.3316, device='cuda:0') 	 0m 1s

I don’t konw how to solve this problem

Hi fyy!

I believe that your network isn’t actually training.

Note, that because you do not shuffle your dataset, you will run
on the exact same batches in each epoch.

Although technically differentiable, the derivative of torch.round()
is zero (almost) everywhere. So the gradients that flow back to
your model parameters will be zero, and optimizer.step()
won’t actually do anything.

This is a red flag. Did you just put this in for good luck? Or did
you observe that loss.grad == False? If the latter, something
upstream of loss in your processing is breaking / detaching
your computation graph (and would prevent gradients, zero or
not, from flowing back to your model parameters).

I see that the loss ad accuracy repeats itself exactly* from epoch
to epoch. So it appears that your model is not training.

*) The second loss in epoch 0 is slightly different than the second
loss in subsequent epochs. I’ll choose to attribute that to a slightly
different order of operations in the gpu and therefore differing
round-off error.

So my working hypothesis is that your model isn’t changing at all.
You get different results, of course, for different batches within an
epoch, but when you analyze the same batch in a subsequent,
you get the same result.

The only thing that doesn’t fit with this explanation is that your loss
is going down systematically from batch to batch (within an epoch),
and your accuracy is going up. That makes it look like your model
is training. But I don’t see anything in your code that would reset
your model from one epoch to the next. It is hard for me to attribute
the decreasing loss to random luck, although it could be due to
some structure in the (unshuffled) order of samples in your training
set.

Step 1: Get rid of torch.round() (and any other zero-derivative
functions) leading up to your loss function.

Step 2: Why are you calling loss.requires_grad = True?
(And why isn’t it throwing an error? What version of pytorch are
you using?)

Good luck.

K. Frank

@KFrank ,Thank you for your kind help,

1.I use Rounding_output_c1 = torch.round(output_c1)
because I want to make my network’s output is integer,
and use these code to caculate my training_acc
labels_float = labels.float() #labels is not Decimal, labels size is [4,224,224]
running_corrects = torch.sum(Rounding_output_c1 == labels_float).float()
labels_size = labels.size(1) * labels.size(2) * 4
training_acc = running_corrects / labels_size

and if I don’t use Rounding_output_c1 = torch.round(output_c1), training_acc will be 0.

2 I use loss.requires_grad = True because when I debug my code I observed that loss requires_grad is False if I don’t use this code It will get an error .
Traceback (most recent call last):
File "F:/experiment_code/U-net/train_2.py", line 85, in <module>
loss.backward()
File "D:\Anaconda3\lib\site-packages\torch\tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "D:\Anaconda3\lib\site-packages\torch\autograd\__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

my pytorch version is QQ图片20200325085636

Hi fyy!

First, I don’t think the overall structure of what you are doing is
correct. More on that, below.

Some specific comments, in line:

Accuracy and loss are two different things. Accuracy is the count
of how many right answers you get, and doesn’t need to be
differentiable.

Loss is the quantity the optimizer attempts to minimize while
training, and does need to be differentiable in order for pytorch’s
gradient-descent optimization algorithms to work.

Of course, your loss function should have some relationship to
your accuracy for it to make sense, but they aren’t the same.

Even if it did make sense to use torch.round() to calculate
your accuracy (and I don’t think it does in your case), you can’t
use it for your loss function because it’s not usefully differentiable.

This means that something in your code is breaking your autograd
computation graph. (Although torch.round() won’t work for
your loss function, it is technically differentiable, and I don’t think
it breaks the computation graph.)

I’m not sure, but maybe this for loop breaks the graph?

Rounding_output_c11 = torch.stack([(Rounding_output_c1 == i).float() for i in range(256)]) #[4,224,224]->[256,4,224,224] 256 is the number of classes, means pixel from 0-255

Perhaps @ptrblck might be able to spot where the graph is getting
broken.

Some comments about your overall approach:

From this I speculate that you are performing multi-class image
segmentation, that is, that you are classifying each pixel in an
image, assigning them nClass different class labels.

The question is what is nClass, that is, what are the possible
values of each element of your [nBatch, 244, 244] labels
tensor?

You do say # 256 is the number of classes, but that seems
rather large. It is the number of possible values of an 8-bit pixel, but
do all such values really occur in your labels tensor?

Anyway, let’s call it nClass, and assume that each value in labels
is an integer class label in the range [0, nClass - 1], inclusive.

(I am assuming that nClass != 2, that is, that this is not a binary
segmentation / classification problem.)

For this you want to use CrossEntropyLoss, and structure your
network output to have shape [nBatch, nClass, 244, 244].
(These are your predictions and are the input to the loss function.)
Note that your number of channels (2) does not show up in the
shape your your output (nor in the shape of labels).

Your labels, again, have shape [nBatch, 244, 244] and are
the target passed to CrossEntropyLoss. (nClass enters into
labels not in its shape, but in the fact the values in labels
range from 0 to nClass - 1.)

If the above makes sense for your use case, then to calculate your
accuracy you would test input.argmax (dim = 1) for equality
against labels (the target you pass to CrossEntropyLoss),
and count the matches.

Good luck.

K. Frank

Yes, I think you nailed it down.
Here is a small dummy example:

x = torch.randn(10, requires_grad=True)
y = torch.stack([(x == i).float() for i in range(256)])
print(y.requires_grad)
>  False

@KFrank explained the usage of nn.CrossEntropyLoss really well, so I don’t have anything to add. :wink:

@KFrank,Thank you for your answer。 Rounding_output_c11 = torch.stack([(Rounding_output_c1 == i).float() for i in range(256)]) #[4,224,224]->[256,4,224,224] 256 is the number of classes, means pixel from 0-255]) is truthly break the graph, I just want to use this code to use one-hot encoding to change output_c1's shape , nClass is pixel value from 0 to 255.
the values of each element of my [nBatch, 244, 244] labels tensor is from 0 to 255.
my network’s output is output_c1 which size is [4,2,224,224] ,It value from 0-255.
I want to use
output_c1 = outputs[:,0,:,:] # outputs [4,2,224,224] ->outputs_c1[4,224,2224]
to choose the second channel to extract Foreground image
If I want to use CrossEntropyLoss ,I don’t konw how to change my outputs’s shape and labels’s shape to correctly calculate the loss . :sob:

@ptrblck .thank you for your help ,from your small dummy example, I find Rounding_output_c11 = torch.stack([(Rounding_output_c1 == i).float() for i in range(256)]) breaks the graph. :blush:

Hi fyy!

I still don’t think that you’re going about this the right way. And I
doubt that you really have a 256-class classification problem.

I assume that the input to your model is some sort of image.
What is the conceptual meaning of such an image, and what
is its shape?

If you successfully train your model, what, at a high, conceptual
level, is your model supposed to tell you about an input image?

What do your labels mean? To be concrete, you say that your
batch-of-labels tensor has shape [4, 244, 244]. What does
the value of a specific element of that tensor mean. That is,
labels[0, 17, 128] is some number. Conceptually, what
is that number telling us?

Best.

K. Frank

@KFrank,
Hi, KFrank!
My dataset is Fundus image. I try to use U-net model to perform medical image segmentation

        for step, data in enumerate(trainloader):
            imgs, labels, _, _ = data
            model.train()
            outputs = model(imgs)   #   B * C * H *W  outputs [4,2,224,224]
            output_c1 = outputs[:,1,:,:]  # output_c1 [4,224,224]
            labels_show =labels.cpu().detach().numpy().astype(np.uint8)
            img_show = imgs.cpu().detach().numpy().astype(np.uint8)
            plt.figure()
            plt.subplot(4, 2, 1)
            plt.imshow(labels_show[0,:,:]),plt.axis('off')
            plt.subplot(4, 2, 2)
            plt.imshow(img_show[0,1,:,:]),plt.axis('off')
            plt.subplot(4, 2, 3)
            plt.imshow(labels_show[1,:,:]),plt.axis('off')
            plt.subplot(4, 2, 4)
            plt.imshow(img_show[1,1,:,:]),plt.axis('off')
            plt.subplot(4, 2, 5)
            plt.imshow(labels_show[2,:,:]),plt.axis('off')
            plt.subplot(4, 2, 6)
            plt.imshow(img_show[2,1,:,:]),plt.axis('off')
            plt.subplot(4, 2, 7)
            plt.imshow(labels_show[3,:,:]),plt.axis('off')
            plt.subplot(4, 2, 8)
            plt.imshow(img_show[3,1,:,:]),plt.axis('off')
            plt.pause(0.5) 


Left column isFundus vessels pictures and also is labels, right colum is Fundus image and also is imgs .
labels information is here
image
labels[0,17,128] tell us The first picture in a batch that Pixel values in row 17 and column 128
and imgs information is here
image
outputs information is here
image
output_c1 information is here
image

Hello fyy!

At this point, the best I can offer you is some general advice.

First: Independent of u-net or pytorch or machine learning, you
need to understand the problem you are trying to solve. You
should look at the data you are working with. Print out some
images and try to segment them by hand (without a computer).
Draw the segmentation in with a pencil, and see how well you
can do.

With all due respect, you replied:

to my question:

Your data has meaning. It’s virtually impossible to do any kind
of worthwhile data analysis (machine learning or not) without
understanding your data.

Second: You need to learn the basics of pytorch before tackling a
more substantive problem. I suggest that you work though some of
the pytorch tutorials to learn the framework and tools. In particular
the Training a Classifier tutorial should be useful because your
segmentation problem is a kind of classification problem.

A hint: I’m convinced that you are working on a binary segmentation
(classification) problem. Even though you posted them in (false) color,
when I look at your sample labels images, I see black-and-white (that
is, binary, rather than grayscale) images.

Good luck.

K. Frank

@KFrank .thank you for your help.