Only batches of spatial targets supported (non-empty 3D tensors) but got targets of size: : [1, 1, 256, 256]

Hi all! I’m trying to find objects in medical images, which are grayscale, and I only have two class: background and the lesion.
I’m scaling my images to 256*256, and I’ve mapped the masks png color numbers as suggested by @ptrblck in multiple topics. However, I’m still getting the error in the title for this line:
loss = F.cross_entropy(prediction, y).
Thank you in advance for your help :slight_smile:

1 Like

If your targets contain the class indices already, you should remove the channel dimension:

target = target.squeeze(1)

You know I’ve done that with element 0, and wondering why it doesn’t work!
And you’re awesome! Thanks for helping and posting :slight_smile:


I am running into a similar issue, but using the squeeze(1) function for the target didn’t solve the issue .

I get the error below:

    model_ft, hist = train_model(model_ft, dataloaders_dict, criterion, optimizer_ft, num_epochs=num_epochs, is_inception=(model_name=="inception"))
  File "", line 270, in train_model
    loss = criterion(outputs, labels)
  File "/home/info/.local/lib/python3.5/site-packages/torch/nn/modules/", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/info/.local/lib/python3.5/site-packages/torch/nn/modules/", line 916, in forward
    ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/info/.local/lib/python3.5/site-packages/torch/nn/", line 1995, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/info/.local/lib/python3.5/site-packages/torch/nn/", line 1826, in nll_loss
    ret = torch._C._nn.nll_loss2d(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: 1only batches of spatial targets supported (non-empty 3D tensors) but got targets of size: : [2, 3, 750, 1000]

And my code is like this:

def train_model(model, dataloaders, criterion, optimizer, num_epochs=25, is_inception=False):    

    since = time.time()
    val_acc_history = []
    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):

        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 10)
        # Each epoch has a training and validation phase

        for phase in ['train', 'val']:

            if phase == 'train':

                model.train()  # Set model to training mode


                model.eval()   # Set model to evaluate mode

            running_loss = 0

            total_train = 0

            correct_train = 0

            # Iterate over data.

            for inputs, labels in dataloaders[phase]:

                inputs = #OriginalImage

                labels = #Masks

                # zero the parameter gradients


                # forward

                # track history if only in train

                with torch.set_grad_enabled(phase == 'train'):

                    # Get model outputs and calculate loss

                    # Special case for inception because in training it has an auxiliary output. In train

                    #   mode we calculate the loss by summing the final output and the auxiliary output

                    #   but in testing we only consider the final output.

                    if is_inception and phase == 'train':

                        # From

                        outputs, aux_outputs = model(inputs)

                        loss1 = criterion(outputs, labels)

                        loss2 = criterion(aux_outputs, labels)

                        loss = loss1 + 0.4*loss2


                        outputs = model(inputs)['out']

                        labels = labels.long()

                        labels = labels.squeeze(1)

                        loss = criterion(outputs, labels)

                    _, preds = torch.max(outputs, 1)

                    # backward + optimize only if in training phase

                    if phase == 'train':



One weird thing that I noticed from the error out put is the part that said size: : [2, 3, 750, 1000].
2 makes sense because the size_batch is 2, but 3 doesn’t make sense because I only have 2 classes, so I’m not sure why I am seeing size: : [2, 3, 750, 1000] when I am expecting size: : [2, 2, 750, 1000].

Also, for context I am trying to finetune the fcn resnet101 segmentation model with my own dataset that only has two classes, and using masked images with two different colors

The expected shape of the target tensor for a multi-class segmentation use case is [batch_size, height, width] containing the class indices, while your target seems to have the unwanted channel dimension.

How did you create the target?

I am not completely sure if this is what you mean by creating the target, but I created masked images with two classes (background and car) like in the examples below.



Note: The images above are the same size, its just showing like different sizes because one of them is a screenshot.

Then I have this class that takes in the path of the original images and masked images:

class MyDataset(Dataset):

    def __init__(self, image_paths, target_paths, train=True):

        self.image_paths = image_paths

        self.target_paths = target_paths

        self.image_dirs = os.listdir(self.image_paths)

        self.target_dirs = os.listdir(self.target_paths)

    def transform(self, image, mask):

        # Resize

        resize = transforms.Resize(size=(768, 1024))

        image = resize(image)

        mask = resize(mask)

        # Random crop

        i, j, h, w = transforms.RandomCrop.get_params(

            image, output_size=(750, 1000))

        image = TF.crop(image, i, j, h, w)

        mask = TF.crop(mask, i, j, h, w)

        # Random horizontal flipping

        if random.random() > 0.5:

            image = TF.hflip(image)

            mask = TF.hflip(mask)

        # Random vertical flipping

        if random.random() > 0.5:

            image = TF.vflip(image)

            mask = TF.vflip(mask)

        # Transform to tensor

        image = TF.to_tensor(image)

        mask = TF.to_tensor(mask)

        #Normalize? Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

        image = TF.normalize(image, [0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

        return image, mask

    def __getitem__(self, index):

        image = + self.image_dirs[index])

        mask = + self.target_dirs[index])

        x, y = self.transform(image, mask)

        return x, y

    def __len__(self):

        #return len(self.image_dirs)

        return len(self.image_dirs)

Is that what you mean by creating a target? Or am I missing something.

One thing that I am confused about is the part where you mentioned creating class indices, because besides using the masked images I haven’t created any class indices. I know I asked a similar question in this post and I haven’t responded to your last reply because I was still trying to make sense of it, and was hoping that by running the code I would figure it out but I’m still confused.

One of the reasons I am confused is that since I am using a pre-trained model, wouldn’t there be an existing mapping of the colors already that I could refer to instead of creating my own indices?

I was looking at this tutorial and the part below seems to have a mapping. I don’t know if this is the official color mapping used for the pre-trained resnet_101 segmentation model, but if it is, when fine tuning a model, wouldn’t it be enough to have masked images that follow this color code (in my case (0,0,0) representing background and (128, 128, 128) representing car) for the model to deduce which class it belongs to or would I have to create new class indices anyways?

And if I do have to create new class indices, like you mention in this post in which part of the code should this happen? Like should I have a helper function creating the class indices (mapping each color to a class) and should I include that within the MyDataset class, and at what point should I call that function?

# Define the helper function
def decode_segmap(image, nc=21):
    label_colors = np.array([(0, 0, 0),  # 0=background
                 # 1=aeroplane, 2=bicycle, 3=bird, 4=boat, 5=bottle
                 (128, 0, 0), (0, 128, 0), (128, 128, 0), (0, 0, 128), (128, 0, 128),
                 # 6=bus, 7=car, 8=cat, 9=chair, 10=cow
                 (0, 128, 128), (128, 128, 128), (64, 0, 0), (192, 0, 0), (64, 128, 0),
                 # 11=dining table, 12=dog, 13=horse, 14=motorbike, 15=person
                 (192, 128, 0), (64, 0, 128), (192, 0, 128), (64, 128, 128), (192, 128, 128),
                 # 16=potted plant, 17=sheep, 18=sofa, 19=train, 20=tv/monitor
                 (0, 64, 0), (128, 64, 0), (0, 192, 0), (128, 192, 0), (0, 64, 128)])
    r = np.zeros_like(image).astype(np.uint8)
    g = np.zeros_like(image).astype(np.uint8)
    b = np.zeros_like(image).astype(np.uint8)
    for l in range(0, nc):
        idx = image == l
        r[idx] = label_colors[l, 0]
        g[idx] = label_colors[l, 1]
        b[idx] = label_colors[l, 2]
    rgb = np.stack([r, g, b], axis=2)
    return rgb

Thanks for the code.
It seems you are indeed dealing with “color” images as your masks. which won’t work out of the box.

I’m not sure, how the mask was created initially, but it seems the background is black, while the car class is a mixture of mostly gray and some artifacts on the borders, which is not good.
If you have mask images with a specific color for each class, e.g. black = background, white = car, red = …, then you could use the mentioned mapping to create class indices.
The result would be a new mask tensor only containing a class index for each pixel (also without the channel dimension). In this case the background could have values of 0, while the car would have the class index 1.

Once you get the mapping, note that you have to be careful about resizing operations, as they might use interpolation techniques by default, thus creating “between class” colors or indices.
E.g. the border of your car could slowly turn from the car class to the background class, which is not useful for a classification task, as you won’t be able to map these colors back to class indices.

The Resize transformation has an interpolation argument, which should be set to PIL.Image.NEAREST for the mask.

The model doesn’t know anything about the class mapping. It has an output conv layer with a number of output channels, which correspond to the classes.
As far as I know, the pretrained models were trained on the COCO dataset, so the output channels would correspond to whatever the COCO dataset defines as class0, class1, etc.

Since you are dealing with two classes and would most likely replace the last conv layer with a new one returning two output channels, you are free to chose whatever mapping you want (for 2 classes you don’t have much choice :wink: ).

I would say it depends how you are loading the image and mask.
If you are lazily loading both in the __getitem__ method of your Dataset, I would convert the mask images there to the right format.

1 Like

Thanks so much for your answer and the note on resizing. I’m still trying to understand all of what you said, but I have a question right of the bat. You said that I’m ‘dealing with “color” images as your masks. which won’t work out of the box’. If color images for masks don’t work out of the box, are there other types of images that do work out of the box as masks?

I’ve seen datasets providing the masks as arrays already with the class index for each pixel, which would work right away.

So if I am understanding this correctly, the target or masks need to be in a matrix type format. Like lets say I have an image that’s 7 by 5 pixels, with two classes. Class 0, corresponds to background and the color black in the mask, and class 1 corresponds to car and the color blue in the mask, then a mask for a 7 by 5 image would be like this?


I know the above doesn’t look anything like a car…more like a square with a border, but I didn’t want to create a huge matrix.

1 Like

Yes, your understanding is correct.
This mask would contains your provided values and should have the shape [batch_size, height ,width].

Thanks so much for clarifying! Something else that I don’t understand is if in the pre-trained model the class index for car is 7, and when I fine tune the model I assign it the value 1 how does it make the connection that both index 7 in the pre-trained model and index 1 in the fine tuned version both refer to the class ‘car’? Basically I want to understand how the knowledge of the pre-trained model that helps identify a car gets transferred when I finetune the model if there are no connection between the original class ID and the one I make up?

By “finetune the model” I assume you are reinitializing the last layer?
Also, let’s assume you are freezing all preceding layers and just train the last new classification layer (nn.Conv2d with out_channels=2 in your case).

The “feature extractor” part of the model, so basically all layers before the final classifiers, are already trained on the COCO dataset and can successfully yield useful features for the classification of the dataset.
If you swap the last classification layer for a new one adapted to your use case, these incoming features will be used to train the classification layer so that is minimizes the loss for your segmentation use case.

The last layer won’t have any knowledge about the trained classes in the past, but the other layers might yield good features for your current classes.

Side note, unrelated to your original question:
Note that this approach might still work, if you change the data domain, e.g. if you would like to segment organs in a CT scan now.
In such a case, I would recommend to try to finetune the complete model, if you have enough data, as the features from the first layers might not be suitable for the CT images.

Thanks for the explanation. So when the feature_extract parameter from this tutorial is set to True, only the last layer is updated, but the other layers still contain information that might be useful.

So would you say that if you have a small dataset, having the the feature_extract parameter set to True would generally yield better results? Specifically if you have classes from the original domain the pre-trained model was trained on, like in my case where I have car and background?

This would be my best guess, yes.
If your dataset is small and from the same domain, I would first try to only retrain the classifier and freeze all other layers.
However, a quick test retraining the whole model might yield other results, so it would be interesting what works better for your use case. :wink:

Sounds good! I was planning on trying it both ways to see what yields better results, but was curious about what was recommended.

Thanks for all your help :slight_smile:

Hi @ptrblck ,

I actually have a few questions about the color mapping example you have from this post.

In the code below you are just creating a dummy mask image? So since I already have the masked RGB image I wouldn’t need to have the code below, correct?

# Create dummy target image
nb_classes = 19 - 1 # 18 classes + background
idx = np.linspace(0., 1., nb_classes)
cmap ='viridis')
rgb = cmap(idx, bytes=True)[:, :3]  # Remove alpha value

h, w = 190, 100
rgb = rgb.repeat(1000, 0)
target = np.zeros((h*w, 3), dtype=np.uint8)
target[:rgb.shape[0]] = rgb
target = target.reshape(h, w, 3)

This is what I have so far for the function that I am going to include in the MyDataset class:

def convertTargetToMatrix(self, target):

        h = 1000

        w = 750

        mapping = {}#Creating a dictionary where key is the class id, and value is the color in mask

        mapping[0] = (0, 0, 0) #Class 0 = background

        mapping[1] = (128, 128, 128) #Class 1 = car

        mask = torch.empty(h, w, dtype=torch.long) #Creates an empty mask to be filled in below step

        #TODO:  Change each rgb value in color mask to its corresponding class index
        #Pseudo code below:
        for y in range(len(h)):
            for x in range(len(w)):
                rbgValue = getRGBOfTargetAtXY(x, y)
                if(rbgValue == mapping[0]):#pixel is background
                    mask[x][y] = 0
                elif(rbgValue == mapping[1]) :  #pixel is car  
                    mask[x][y] = 1

        return mask 

Can you let me know if I am going in the right direction or if I am doing something wrong here? The pseudo code I wrote is how I conceptually understand what’s happening (aka replacing rbg values with class ids), but if there is a specific function in torch or torchvision that does this better and more efficiently please let me know. I don’t know if the code below is doing that, as I don’t really understand it.

for k in mapping:
    # Get all indices for current class
    idx = (target==torch.tensor(k, dtype=torch.uint8).unsqueeze(1).unsqueeze(2))
    validx = (idx.sum(0) == 3)  # Check that all channels match
    mask[validx] = torch.tensor(mapping[k], dtype=torch.long)

specifically the line below, I don’t understand what its doing

idx = (target==torch.tensor(k, dtype=torch.uint8).unsqueeze(1).unsqueeze(2))

I have 4 masks per image and each mask is a binary image containing 0s and 1s. I have concatenated them into a tensor [batch, 4, 224, 224]. But I’m getting the error mentioned in the topic. Is there a different way that I need to stack these masks ?

1 Like

Are you dealing with a multi-class or multi-label segmentation use case?
In the former case, each pixel would belong to a single class only, while each pixel might belong to zero, one, or more classes in the latter case.

For a multi-class segmentation, you would most likely use nn.CrossEntropyLoss and your target is expected to be a LongTensor containing the class indices in the range [0, nb_classes-1] in the shape [batch_size, height, width].

Currently your target seem to be a one-hot encoded (or multi-hot encoded) target.
Let me know, what kind of use case you are working on and if you need more information. :slight_smile:

Sorry for the late reply.

Yes, in my code snippet I’m using a randomly initialized mask, so you should use your mask.
I’m not aware of a fast built-in method to create this mapping, but I’m sure that my code is not the most efficient one. :wink:

The line of code gets the indices for the current class index by comparing the target image to the current class mapping (color code).