Converting RGB Masks to masks with class index for segmentation

I am trying to finetune the fcn_resnet101 segmentation model with my own dataset, and I am currently getting stuck in the step where I need to convert RGB masked images to ones that contain the class index for the pixel.

My input mask is a RGB Image that has two colors for each class( i.e. black for background, blue for car).

I adapted the code I found in this post to the following:

def mask_to_class(self, mask):

        #target = torch.from_numpy(mask)

        target = mask

        h,w = target.shape[0],target.shape[1]

        masks = torch.empty(h, w, dtype=torch.long)

        colors = torch.unique(target.view(-1,target.size(2)),dim=0).numpy()

        #print("colors: " + colors)

        print("len(colors): " + str(len(colors)))

        target = target.permute(2, 0, 1).contiguous()

        mapping = {tuple(c): t for c, t in zip(colors.tolist(), range(len(colors)))}

        #print("mapping: " + str(mapping))

        for k in mapping:

            print("k: " + str(k))

            idx = (target==torch.tensor(k, dtype=torch.uint8).unsqueeze(1).unsqueeze(2))

            validx = (idx.sum(0) == 3) 

            masks[validx] = torch.tensor(mapping[k], dtype=torch.long)

        return masks

To be honest, I don’t fully understand the code above, but I assume from the post and the name of the function that purpose of the mask_to_class is to convert the RGB masks to masks that contain the class index instead of the RGB value.

I then call that function in the _getitem_ function below:

def __getitem__(self, index):

        image = + self.image_dirs[index])

        mask = + self.target_dirs[index])

        image, mask = self.transform(image, mask)

        mask = self.mask_to_class(mask)

        return image, mask

And my transform function looks like this:

def transform(self, image, mask):

        # Random horizontal flipping

        if random.random() > 0.5:

            image = TF.hflip(image)

            mask = TF.hflip(mask)

        # Random vertical flipping

        if random.random() > 0.5:

            image = TF.vflip(image)

            mask = TF.vflip(mask)

        # Transform to tensor

        image = TF.to_tensor(image)

        mask = TF.to_tensor(mask)

        #Normalize? Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

        image = TF.normalize(image, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

        return image, mask

But when I run the code I get this error:

  File "/home/info/backend/", line 142, in __getitem__
    mask = self.mask_to_class(mask)
  File "/home/info/backend/", line 128, in mask_to_class
    idx = (target==torch.tensor(k, dtype=torch.uint8).unsqueeze(1).unsqueeze(2))
RuntimeError: Expected object of scalar type Float but got scalar type Byte for argument #2 'other'

I appreciate your help on this.

Also, something weird that I noticed while trying to debug this is that when I printed put len(colors) I got this when I was expecting len(colors) to be 2 since there are only two colors in the mask:

len(colors): 529
len(colors): 711
len(colors): 715
len(colors): 775

I think I was able figure out how to fix this issue. Basically I added this three lines to the mask_to_class function to convert my input image which was in PIL format to an np arrray and kept everything else the same as this post:

mask = np.array(pilImage) #convert from pil Image to numpy array

mask = mask[...,:3]#remove alpha values

target = torch.from_numpy(mask)

I then called the mask_to_class function in the transform function in place of the mask = TF.to_tensor(mask).

After doing this I don’t get the error anymore and len(colors) is 2 as expected, but I do get this error:

input and target batch or spatial sizes don't match: target [1 x 1000 x 750], input [1 x 2 x 750 x 10

If anyone has an idea on how to deal with this let me know

It seems the spatial sizes are permuted in the output or the target tensor.
What shapes do you expect as the height and width?
The current format would be [batch_size, channels, height, width], so the height and width are mixed in one of the mentioned tensors.

Yes. After some debugging I realized that paths that contained the input and target images were out of order, so the the original image wasn’t being compared to the corresponding mask. I sorted the lists, and so far I don’t see the error, but the training step is still running. Actually it’s been running for a few hours now, which I guess it’s a good sign, but I was wondering how long finetuning is supposed to take when I have around 500 images in my data set, a batch size of 2, 14 epochs and running on GPU?

The runtime depends on your current setup (hardware as well as libraries) as well as your model.

That being said, is a single epoch using 500 images taking “a couple of hours”?
If so, could you post the model architecture as well as the training loop, as this seems to be too slow.

It took about 10 hours to complete, but I’m going to have to run it again because it doesn’t look like I calculated accuracy correctly. I was using the fcn_resnet101 segmentation model.

Could you check the GPU utilization via nvidia-smi please?
Do the 10 hours refer to a single epoch or the complete training, i.e. multiple epochs?