Multi-Class Cross Entropy Loss function implementation in PyTorch

edowson · June 7, 2018, 5:17pm

I came across an implementation of a BCEDiceLoss function in PyTorch, by Jeff Wen for a binary segmentation problem using a different dataset and U-net.

github:jeffwen/road_building_extraction - metrics.py

The losses and eval metrics look a lot better now, given the low performance of the NN at 50 epochs. I’ve read that it takes between 300 to 500 epochs to get meaningful results.

100% 1/1 [00:16<00:00, 16.51s/it, loss=1.553]
2018-06-07 21:22:25 INFO     | train_unet:train:145: - Train metrics: dice_coeff: 0.13889287 ; jaccard_index: 0.00000000 ; loss: 1.55340451
2018-06-07 21:22:28 INFO     | evaluate_unet:evaluate:122: - Eval metrics : dice_coeff: 0.16866308 ; jaccard_index: 0.00000000 ; loss: 1.52735714
2018-06-07 21:22:28 INFO     | train_unet:train_and_evaluate:203: - Found new best accuracy
2018-06-07 21:22:28 INFO     | train_unet:train_and_evaluate:174: Epoch 52/500

edowson · June 8, 2018, 1:18pm

Hi @ptrblck

I ran your U-Net model for around 3300 epochs, but it wasn’t giving good results, and it only learnt how to detect structures (ch1), with the rest being mostly gray or black.

I decided to try and debug it with a single class, that of buildings for ch0. I reconfigured the model with 3-ch input and 1-ch output with a binary cross entropy loss with dice loss function. I trained it for 100 epochs using a batch size of 3, but ch0 predictions are coming up empty. I was expecting at least something, but it was totally blank.

Here is an input image:

Here is the corresponding mask ch0 buildings:

and here is the predictions for ch0, which is mostly empty

I then switched to using a U-Net model from here github:minerva-ml/steppy-toolkit - U-Net

I’m seeing some activations for another test image in ch0 at 20 epochs, but the predictions don’t improve beyond this.

edowson · June 9, 2018, 10:29am

Hi @ptrblck

I finally found out what was going wrong. It wasn’t NN model related.

In the end, it turned out that there might have been three issues.

With torch you have data conversion from numpy to a torch tensor, and data movement from cpu to gpu.

When you need to compute the metrics, one has to take care about calculating the metrics (loss, accuracy), in either numpy or torch. In my train and eval loops, because the CS230 template initially used numpy (after moving it from the GPU using calling tensor.cpu(), and then converting it to numpy, using tensor.numpy()), and mid-way I used a Torch version of the BCEDiceLoss and Dice coefficient, the metrics were off and the NN couldn’t learn anything.

Numpy uses HxWxC ordering, whereas PyTorch uses CxHxW ordering.

In the original code, when working with the GPU, the train and labels batch are not explicitly cast to Torch Variables.

Q01: Would this have had any impact?

# Use tqdm for progress bar
with tqdm(total=len(dataloader)) as t:
    for i, (train_batch, labels_batch) in enumerate(dataloader):
        # move to GPU if available
        if params.cuda:
            train_batch, labels_batch = train_batch.cuda(async=True), labels_batch.cuda(async=True)
        # convert to torch Variables
        train_batch, labels_batch = Variable(train_batch), Variable(labels_batch)

I modified it as follows, casting both the GPU case and the CPU case to use torch variables.

# iterate over the data, use tqdm for progress bar
with tqdm(total=len(dataloader), desc="training") as t:
    for i, samples_batch in enumerate(dataloader):

        # extract data and labels batch
        train_batch = samples_batch['image']
        labels_batch = samples_batch['mask']

        # convert to torch variables, move to GPU if available
        if params.cuda:
            train_batch = Variable(train_batch.cuda(async=True))
            labels_batch = Variable(labels_batch.cuda(async=True))
        else:
            train_batch = Variable(train_batch)
            labels_batch = Variable(labels_batch)

When I used the model with an LR scheduler, I forgot to update the step for the LR scheduler, thinking calling step on the scheduler would automatically call step on the optimizer. This could have been another reason why the model didn’t progress training further.

The NN is beginning to pick out the buildings now.

These are the predictions during train:

tensorboard-step-68

These are the predictions during validation.

tesnorboard-step-77

ptrblck · June 9, 2018, 11:27am

In both of your training loops the tensors are wrapped by Variables. So this shouldn’t have any impact on the result.

Nice, it’s working now! The predictions look better than before

edowson · June 9, 2018, 11:43am

The images and mask were already converted to Tensors by the Dataset, as part of its internal transforms.

I read in the docs that tensors are wrapped by default as variables? So does this mean that I can remove the calls to wraps these tensors as variables, since auto-grad is enable by default for tensors?

edowson · June 9, 2018, 11:46am

It wouldn’t have been possible without your help! Thank you for your patience! I’ve taken my first steps in learning how to work with PyTorch.

ptrblck · June 9, 2018, 11:56am

In the latest stable release (version 0.4.0) Variables and tensors were merged. So you can just skip the wrapping. You can check your version with

print(torch.__version__)

If you have an older version, have a look at the website for the install instructions.

You are very welcome!

edowson · June 9, 2018, 12:00pm

I’m using v0.4.0. I’m looking forward to the 0.5.0 release! One thing that stands out is PyTorch’s ability to distribute a workload across multiple GPUs.

I glanced through the advanced tutorial on message passing between multiple machines, but haven’t had a chance to try that out. I think I’ll try to configure a setup where one machine prepares the data primarily using its CPU and serving data to another machine with multiple GPUs, as an experiement.

ptrblck · June 9, 2018, 12:05pm

You could try to use multiple GPUs in one machine using nn.DataParallel, which is quite easy.
I haven’t worked with distributed machines yet.

I think the next release will be 1.0. Looking forward to it!

Preet_Khaturia · January 4, 2019, 12:00pm

Hi @ptrblck, can I apply softmax on the prediction (from the model). I am using dice loss and my target output are binary mask for mulitple classes.
Do I have to apply sigmoid also??

ptrblck · January 4, 2019, 4:35pm

Does it mean that your each pixel contains just one specific class or could one pixel be assigned to more than a single class?
In the former case, you could use softmax, which will normalize the logits to sum to 1 for each pixels, while you could probably apply sigmoid in the latter case.

Preet_Khaturia · January 7, 2019, 5:22am

Thanks, mine is the former case.

Preet_Khaturia · January 7, 2019, 9:41am

Hello @edowson,
I have a similar dataset and want to normalize the images between [0,1].

mean_img=[]
std_img=[]
for i in range(4):
mean_img.append(train_images[:,:,:,i].mean())
std_img.append(train_images[:,:,:,i].std())

But I am not getting the result between [0,1] using transforms.Normalize(). Any help?

Justin_Brown · January 26, 2019, 8:37pm

I’m having the same problem. The tensorflow version of this function(tf.losses.softmax_cross_entropy) gladly accepts multiclass labels. I’m doing image to image conversion, and need the ability to have multiple channels in my ground truth image. I’m going from [b, 3, h, w] to [b, 3, h, w]. Is there a way to do this, or any plans to update the cross entropy loss function?

Chloe_Su · January 25, 2022, 2:30pm

Hi sir
Can I ask a relevant but somewhat different question here
for CrossEntropyLoss in torch 1.10
it’s supposed to work for input with size [4, 20] and target of size [4] right
but mine is saying expected target to have shape [4, 20] too
I really don’t want to incur an extra step of building one-hot vectors here

ptrblck · January 25, 2022, 7:10pm

Maybe the dtype is wrong of the target and nn.CrossEntropyLoss assumes you are trying to pass probabilities to the criterion instead of class labels.
Here is a small example what shapes work:

criterion = nn.CrossEntropyLoss()

batch_size = 16
nb_classes = 10
h, w = 24, 24

output = torch.randn(batch_size, nb_classes, h, w, requires_grad=True)
target = torch.randint(0, nb_classes, (batch_size, h, w))

loss = criterion(output, target) # works

loss = criterion(output, target.float())
# > RuntimeError: expected scalar type Long but found Float

loss = criterion(output, torch.nn.functional.one_hot(target).permute(0, 3, 1, 2))
# > RuntimeError: Expected floating point type for target with class probabilities, got Long

loss = criterion(output, torch.nn.functional.one_hot(target).float().permute(0, 3, 1, 2)) # works

As you can see your described shapes should work if the target is a LongTensor containing class indices.