Multi-Class Cross Entropy Loss function implementation in PyTorch

Hi @ptrblck

The earlier mask corruption was fully attributed to something going wrong with opencv resize. I tried the same operation without resizing it, converted it to a torch tensor after doing the class to index mapping, and back. I could see the original mask shape without any issues or corruption. This would explain why I had issues training the NN earlier, for some of the early test runs. I’m only going to crop images of the required size and feed it into the NN.

Have you tried resizing your images with other frameworks as PIL or scikit-image?

To Q01:
If you use BCELoss, your target should have the save type as the output, so you would have to transform it into a FloatTensor. You should leave the ones and zeros as they are.

To Q02:
No, if you use another loss function like CrossEntropyLoss you would have to get the indices of the classes as we have done before.

No, but I’ll try it out.

Hi @ptrblck

I made the changes to my model, switched to using a 3-ch input and generating a 10-ch mask with BCELoss.

When I try to run the train loop, I get this error for a single batch. I also get the exact same error when I use the U-Net model that you provide in the link.

2018-06-07 00:27:32 INFO     | train_unet:train_and_evaluate:182: Epoch 1/10
  0% 0/20 [00:00<?, ?it/s]/tool/python/conda/env/gis36/lib/python3.6/site-packages/torch/nn/functional.py:1474: UserWarning: Using a target size (torch.Size([1, 3, 256, 256])) that is different to the input size (torch.Size([1, 10, 256, 256])) is deprecated. Please ensure they have the same size.
  "Please ensure they have the same size.".format(target.size(), input.size()))

Traceback (most recent call last):
  File "/project/geospatial/application/cs230-sifd/source/main/train/train_unet.py", line 284, in <module>
    main()
  File "/project/geospatial/application/cs230-sifd/source/main/train/train_unet.py", line 280, in main
    restore_file=params.restore_file)
  File "/project/geospatial/application/cs230-sifd/source/main/train/train_unet.py", line 190, in train_and_evaluate
    params=params)
  File "/project/geospatial/application/cs230-sifd/source/main/train/train_unet.py", line 100, in train
    loss = loss_fn(output_batch, labels_batch)
  File "/tool/python/conda/env/gis36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/tool/python/conda/env/gis36/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 433, in forward
    reduce=self.reduce)
  File "/tool/python/conda/env/gis36/lib/python3.6/site-packages/torch/nn/functional.py", line 1477, in binary_cross_entropy
    "!= input nelement ({})".format(target.nelement(), input.nelement()))
ValueError: Target and input must have the same number of elements. target nelement (196608) != input nelement (655360)

Process finished with exit code 1

I’m guess it is perhaps because I converted my images and mask to a PIL image, and it has changed the shape of the mask to (3, 256, 256) from the original (10, 256, 256) shape.

This is what I am doing towards the end of my dataset class:

        image = torch.from_numpy(image)
        image = tvf.to_pil_image(image)

        mask = torch.from_numpy(mask)
        mask = tvf.to_pil_image(mask)

        """
        Apply user-specified transforms to image and mask.
        """
        if self.transform:
            image, mask = self._transform(image, mask, self.transform)

        """
        Sample of our dataset will be dict {'image': image, 'mask': mask}.
        This dataset will take an optional argument transform so that any
        required processing can be applied on the sample.
        """
        sample = {'image': image,
                  'mask' : mask}

        return sample

I guess this might be the problem.
I would suggest to try the suggestion in this post to use 10 binary target images.

Hi @ptrblck

The BCELoss setup with the 10-ch target mask is now setup correctly.

I’ve had to write my own numpy routines to do the random crop, so that all 10-ch of the mask gets cropped. (I also briefly tried skimage transformers, but those too changed a 10-ch mask into a 3-ch mask.)

For a sample training batch of 1, this is the input image:

With the BCE setup, the target’s get allocated correctly, with only ch3, ch4 and ch5 having content, with the rest of the channels empty (black).

ch3

ch4

ch5

I just ran it for a small run of 20 epochs, just to see what is being picked up. (I know I have to run it for around 300 to 500 epochs to get it working correctly, but thought I’d check if things are setup correctly first.)

So, during test, for this input:

the target masks are only for ch4 (trees), the rest are all blank:

but there are no predictions in ch4, all of them are blank, except for some faint activations in ch1 and ch3.

This is the case for the other test images.

ch 1 is for structures and ch3 is track. ch trees is empty.

Why do you think this is? i.e. getting activations in the wrong channel / class for the test images?

I don’t know, if we can speculate about the wrong channel ordering based on this run.
Are your training predictions looking correct?
This would be the first thing to check before evaluating.

If you’ve made sure the training output is correct, you could have a look at your validation set.

At the beginning of the training sometimes the channels might predict weird things, but this should stabilize pretty fast.

Is there some way to get data augmentation done via controlling the number of batches and random crop.

The original dataset contains 3k x 3k

  • train: 20 images
  • dev: 2 images
  • test: 3 images

What I would like to do is to augment these so that I get

  • 50000 train images via random crop (the following dataset transform shows only crop at the moment, but I will add horizontal and vertical flip shortly)
  • 10000 dev images
  • 10000 test images

I have a params.yaml file that controls the number of batches for each dataset. Is there some combination of using epochs and batch numbers that I can use, in order to ensure that the NN model actually sees 50000 train images and 10000 dev images, via in-line transform data augmentation in the dataset class?

parameters:

  # model parameters
  model: u-net
  description: 3-ch input, 10-ch output BCELoss configuration.
  in_channels: 3
  out_channels: 10

  # general parameters
  learning_rate: 1e-3
  num_epochs: 300
  save_summary_steps: 1

  # dataset parameters
  dataset:

    # dataset pre-processing parameters
    preprocessing:

      # image preprocessing
      image:

        # resize
        resize:
          height: 3328
          width:  3328
          interpolation:

        # crop
        crop:
          height: 512
          width: 512

        # align
        align: None

  # training parameters
  train:
    batch_size: 30
    num_workers: 4

  # training parameters
  valid:
    batch_size: 6
    num_workers: 4

  # test parameters
  test:
    batch_size: 1
    num_workers: 4

If your data augmentation is in the __getitem__ method of your Dataset, it’s being applied on the fly, i.e. each iteration produces a new transformed batch of samples.

Since you just have 20 training images, your batch size can be set to 20 as well.
To see 50000 training images, you would need 2500 epochs (2500*20=50000).

Your dataset is quite small, so you have to be quite careful with overfitting.
Also, a pre-trained model might help in this case.

Most people on the DSTL competition also worked with this small dataset, but resorted to data augmentation and got good result, albeit using an ensemble of individually trained u-net nn models for each specific class.

In my case, I’m trying to generalize it using a single model. After I get the U-Net single model infrastructure up and running, with some reasonable results, I will switch models and try with a Capsule based one, based on this paper: Capsules for Object Segmentation - Rodney LaLonde, Ulas Bagci - 2018 . To quote:

SegCaps reduced the number of parameters of U-Net architecture by 95.4% while still providing a better segmentation accuracy.

If it hadn’t been for PIL’s limitation of being able to handle only 3-ch images, I would have been able to create an offline cache dataset easily.

I will now have to probably write a separate routine to transform individual channels of an image and mask, separately and then concatenate them back into a multi-channel image and save it to disk.

My losses are going down but the accuracy is still zero. This can’t be right.

2018-06-07 16:01:47 INFO     | train_unet:train:147: - Train metrics: accuracy: 0.000000 ; loss: 0.166378
2018-06-07 16:01:49 DEBUG    | evaluate_unet:evaluate:106: eval loss: 0.285360187292099
2018-06-07 16:01:49 DEBUG    | evaluate_unet:evaluate:125: eval accuracy: 0.000000
2018-06-07 16:01:49 INFO     | evaluate_unet:evaluate:137: - Eval metrics : accuracy: 0.000000 ; loss: 0.285360
2018-06-07 16:01:49 INFO     | train_unet:train_and_evaluate:205: - Found new best accuracy
2018-06-07 16:01:49 INFO     | train_unet:train_and_evaluate:176: Epoch 45/2500

If I tried over-fitting the model with just 1 sample for both the train and eval, the accuracy figure goes up and I can see that the predictions were taking shape in some of the channels.

What do you think?

You successfully overfitted using one sample but the accuracy is zero, if you use all 20 samples?
Is the training loss decreasing using all samples?
How do the training predictions look like? Are they all black or do they start to predict something?

When I overfit the model with one sample, the accuracy goes up.

When training over the 20 batch train, 2 dev and 3 test, the accuracy doesn’t go up.

This is my accuracy function. Does it look okay?

def accuracy(outputs, labels):
    """
    Compute the accuracy, given the outputs and labels for all images.

    Args:
        outputs: (np.ndarray) log softmax output of the model
        labels: (np.ndarray) labels

    Returns: (float) accuracy in [0,1]
    """
    outputs = np.argmax(outputs, axis=1)
    return np.sum(outputs == labels)/float(labels.size)

On the bright side, the NN seems to be predicting things a bit more correctly after training for 50 epochs. It is starting to correctly classify trees, crops, etc, similar looking classes for test inputs.

for this test sample

2018-06-07 16:21:59 DEBUG    | visualize:display_mask:55: mask shape: (512, 512, 10)
2018-06-07 16:22:00 DEBUG    | evaluate_unet:evaluate:106: eval loss: 0.1992570161819458
2018-06-07 16:22:00 DEBUG    | visualize:display_mask:55: mask shape: (512, 512, 10)
2018-06-07 16:22:02 DEBUG    | evaluate_unet:evaluate:125: eval accuracy: 0.00000343
2018-06-07 16:22:02 INFO     | evaluate_unet:evaluate:137: - Eval metrics : accuracy: 0.00000343 ; loss: 0.19925702

Q01: The loss is quite low, it shouldn’t be so low right?

Q02: The predicted mask values are in float, so there are some activations, but shouldn’t the loss it be computed on whether the activation is a binary 1 or really close to 1 in predicted mask, instead of some small floating point value? Is the BCELoss function working correctly in this case?

If you are using BCELoss, you have to use a sigmoid layer for your model output, not log_softmax!
Also, since you are now creating a binary prediction for each output channel, you also shouldn’t use argmax anymore, but compare the predictions channel-wise:

x = F.sigmoid(torch.randn(10, 5, 12, 12))
y = torch.empty(10, 5, 12, 12).random_(2)

threshold = 0.5
x = x > threshold
(x.float() == y).float().sum()/float(y.nelement())

I’m using using U-Net model.

    def forward(self, x):
        # Encoder
        x = self.init_conv(x)
        x1 = self.down1(x)
        x2 = self.down2(x1)
        x3 = self.down3(x2)
        # Decoder
        x_up = self.up3(x3, x2)
        x_up = self.up2(x_up, x1)
        x_up = self.up1(x_up, x)
        x_out = F.sigmoid(self.out(x_up))
        return x_out

There is a sigmoid layer at the final output? Is this correct, or do I have to make a change some where else?

It’s correct. In your accuracy function’s doc it states

outputs: (np.ndarray) log softmax output of the model

That’s why I was wondering.

Sorry, that was an error in the doc for that function.

Ok, no worries. Have you applied a threshold before computing the accuracy?

I’ve modified my accuracy function to use a threshold of 0.95.

def accuracy(outputs, labels):
    """
    Compute the accuracy, given the outputs and labels for all images.

    Args:
        outputs: (np.ndarray) sigmoid output of the model
        labels: (np.ndarray) labels

    Returns: (float) accuracy in [0,1]
    """
    threshold = 0.95
    outputs = outputs > threshold
    return (outputs.float() == labels).float().sum()/float(labels.nelement())

During evaluation of the model in its current state, it is still giving a very high accuracy, of 0.82 with a loss of 0.31, although the threshold is set to 0.95.

This is the input image during prediction, along with ch3 and ch4 mask, and the corresponding predictions for ch3 and ch4.

there are the predicted masks for ch3 and ch4:

As you can see in the predictions for ch3 and ch4, only the ch4 prediction is correct, but ch3 has a ghost image of ch4. On the flip side ch3 is a track, and ch4 are trees. and perhaps this could be improved by using a reflectance index that is good at detecting vegetation and add it as an extra input layer to the model, so that it can detect trees and vegetation better.

However, the accuracy should still not be this high, given its current level of performance.