Only batches of spatial targets supported (non-empty 3D tensors) but got targets of size: : [1, 1, 256, 256]

Thanks so much for clarifying! Something else that I don’t understand is if in the pre-trained model the class index for car is 7, and when I fine tune the model I assign it the value 1 how does it make the connection that both index 7 in the pre-trained model and index 1 in the fine tuned version both refer to the class ‘car’? Basically I want to understand how the knowledge of the pre-trained model that helps identify a car gets transferred when I finetune the model if there are no connection between the original class ID and the one I make up?

By “finetune the model” I assume you are reinitializing the last layer?
Also, let’s assume you are freezing all preceding layers and just train the last new classification layer (nn.Conv2d with out_channels=2 in your case).

The “feature extractor” part of the model, so basically all layers before the final classifiers, are already trained on the COCO dataset and can successfully yield useful features for the classification of the dataset.
If you swap the last classification layer for a new one adapted to your use case, these incoming features will be used to train the classification layer so that is minimizes the loss for your segmentation use case.

The last layer won’t have any knowledge about the trained classes in the past, but the other layers might yield good features for your current classes.

Side note, unrelated to your original question:
Note that this approach might still work, if you change the data domain, e.g. if you would like to segment organs in a CT scan now.
In such a case, I would recommend to try to finetune the complete model, if you have enough data, as the features from the first layers might not be suitable for the CT images.

Thanks for the explanation. So when the feature_extract parameter from this tutorial is set to True, only the last layer is updated, but the other layers still contain information that might be useful.

So would you say that if you have a small dataset, having the the feature_extract parameter set to True would generally yield better results? Specifically if you have classes from the original domain the pre-trained model was trained on, like in my case where I have car and background?

This would be my best guess, yes.
If your dataset is small and from the same domain, I would first try to only retrain the classifier and freeze all other layers.
However, a quick test retraining the whole model might yield other results, so it would be interesting what works better for your use case. :wink:

Sounds good! I was planning on trying it both ways to see what yields better results, but was curious about what was recommended.

Thanks for all your help :slight_smile:

Hi @ptrblck ,

I actually have a few questions about the color mapping example you have from this post.

In the code below you are just creating a dummy mask image? So since I already have the masked RGB image I wouldn’t need to have the code below, correct?

# Create dummy target image
nb_classes = 19 - 1 # 18 classes + background
idx = np.linspace(0., 1., nb_classes)
cmap ='viridis')
rgb = cmap(idx, bytes=True)[:, :3]  # Remove alpha value

h, w = 190, 100
rgb = rgb.repeat(1000, 0)
target = np.zeros((h*w, 3), dtype=np.uint8)
target[:rgb.shape[0]] = rgb
target = target.reshape(h, w, 3)

This is what I have so far for the function that I am going to include in the MyDataset class:

def convertTargetToMatrix(self, target):

        h = 1000

        w = 750

        mapping = {}#Creating a dictionary where key is the class id, and value is the color in mask

        mapping[0] = (0, 0, 0) #Class 0 = background

        mapping[1] = (128, 128, 128) #Class 1 = car

        mask = torch.empty(h, w, dtype=torch.long) #Creates an empty mask to be filled in below step

        #TODO:  Change each rgb value in color mask to its corresponding class index
        #Pseudo code below:
        for y in range(len(h)):
            for x in range(len(w)):
                rbgValue = getRGBOfTargetAtXY(x, y)
                if(rbgValue == mapping[0]):#pixel is background
                    mask[x][y] = 0
                elif(rbgValue == mapping[1]) :  #pixel is car  
                    mask[x][y] = 1

        return mask 

Can you let me know if I am going in the right direction or if I am doing something wrong here? The pseudo code I wrote is how I conceptually understand what’s happening (aka replacing rbg values with class ids), but if there is a specific function in torch or torchvision that does this better and more efficiently please let me know. I don’t know if the code below is doing that, as I don’t really understand it.

for k in mapping:
    # Get all indices for current class
    idx = (target==torch.tensor(k, dtype=torch.uint8).unsqueeze(1).unsqueeze(2))
    validx = (idx.sum(0) == 3)  # Check that all channels match
    mask[validx] = torch.tensor(mapping[k], dtype=torch.long)

specifically the line below, I don’t understand what its doing

idx = (target==torch.tensor(k, dtype=torch.uint8).unsqueeze(1).unsqueeze(2))

I have 4 masks per image and each mask is a binary image containing 0s and 1s. I have concatenated them into a tensor [batch, 4, 224, 224]. But I’m getting the error mentioned in the topic. Is there a different way that I need to stack these masks ?

1 Like

Are you dealing with a multi-class or multi-label segmentation use case?
In the former case, each pixel would belong to a single class only, while each pixel might belong to zero, one, or more classes in the latter case.

For a multi-class segmentation, you would most likely use nn.CrossEntropyLoss and your target is expected to be a LongTensor containing the class indices in the range [0, nb_classes-1] in the shape [batch_size, height, width].

Currently your target seem to be a one-hot encoded (or multi-hot encoded) target.
Let me know, what kind of use case you are working on and if you need more information. :slight_smile:

Sorry for the late reply.

Yes, in my code snippet I’m using a randomly initialized mask, so you should use your mask.
I’m not aware of a fast built-in method to create this mapping, but I’m sure that my code is not the most efficient one. :wink:

The line of code gets the indices for the current class index by comparing the target image to the current class mapping (color code).

I’m dealing with multi-class where each pixel can be assigned to a single class only. I had 4 different masks for each image where each mask represents one class. So I had 4 binary masks. What I have proceeded to do now is that I converted the binary masks to have values 0/1 for mask 1, 0/2 for mask 2 and so one and then added all of them to have a mask something like


So how can I do color mapping for each class now ?

1 Like

The target looks correct.
You could most likely create it by using torch.argmax(target, dim=1).

What kind of color mapping do you need?

I’m sorry, I didn’t get

Is this for creating the mapping ?

I need simple color mapping, 4 colors representing each class.

Do you need this to restore the original target image?
If so, you could create a mapping with e.g. a dict and index it with your target tensor.
Note that your current target tensor is suitable to be passed into nn.CrossEntropyLoss.

I’m using nn.CrossEntropyLoss only. I need to color mapping to visualize if the segmentation is done correctly.

In this case indexing should work:

cmap = torch.tensor([[255, 0, 0],
                     [0, 255, 0],
                     [0, 0, 255]])

target = torch.randint(0, 3, (1, 10, 10))
res = cmap[target]

This seems like the index mapping can work.

Also, correct me if I’m wrong. I’m breaking down my procedure -

  1. I created a custom dataset that fetches the images and the masks (add the masks as said above)
  2. The dataloader is iterating over these.
  3. Im using DeepLabV3 pretrained model with the head changed to 4 (total classes for me)
  4. Using nn.CrossEntropyLoss
  5. The training doesn’t seem to be promising at all.

Is there anything I should be doing extra for semantic segmentation?

The procedure looks correct.
I would recommend to try to overfit a small data sample (e.g. just 10 data samples) to verify the training procedure does not contain any hidden errors.

I just did a 5 epochs. I have about 250 images.
I’m a bit confused now. The shape of my outputs from the model is torch.Size([5, 4, 224, 224]). Batch size is 5 and there are 4 masks.

Now to get the prediction how do I combine these 4 separate outputs to a single mask ?

pred = torch.argmax(output, 1) would give you the predicted class indices, which you could then pass to your mapping to get the corresponding colors.

Thank you. This makes sense to use index with the class for mapping!