How to set mask labels for mask R-CNN so that I can fine-tune it into a 3-classes classification and segementation model?

Hello, I recetently followed the tutorial on the Pytorch Official Webset(Link) to fine-tune a Mask R-CNN.

In the tutorial, the author just need to seperate the pedstrain and the background. So the definition of the Dataset class looks like this:
class PennFudanDataset(object):

  • def init(self, root, transforms):*

  •    self.root = root*
    
  •    self.transforms = transforms*
    
  •    # load all image files, sorting them to*
    
  •    # ensure that they are aligned*
    
  •    self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))*
    
  •    self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))*
    
  • def getitem(self, idx):*

  •    # load images ad masks*
    
  •    img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])*
    
  •    mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])*
    
  •    img = Image.open(img_path).convert("RGB")*
    
  •    # note that we haven't converted the mask to RGB,*
    
  •    # because each color corresponds to a different instance*
    
  •    # with 0 being background*
    
  •    mask = Image.open(mask_path)*
    
  •    # convert the PIL Image into a numpy array*
    
  •    mask = np.array(mask)*
    
  •    # instances are encoded as different colors*
    
  •    obj_ids = np.unique(mask)*
    
  •    # first id is the background, so remove it*
    
  •    obj_ids = obj_ids[1:]*
    
  •    # split the color-encoded mask into a set*
    
  •    # of binary masks*
    
  •    masks = mask == obj_ids[:, None, None]*
    
  •    # get bounding box coordinates for each mask*
    
  •    num_objs = len(obj_ids)*
    
  •    boxes = []*
    
  •    for i in range(num_objs):*
    
  •        pos = np.where(masks[i])*
    
  •        xmin = np.min(pos[1])*
    
  •        xmax = np.max(pos[1])*
    
  •        ymin = np.min(pos[0])*
    
  •        ymax = np.max(pos[0])*
    
  •        boxes.append([xmin, ymin, xmax, ymax])*
    
  •    # convert everything into a torch.Tensor*
    
  •    boxes = torch.as_tensor(boxes, dtype=torch.float32)*
    
  •    # there is only one class*
    
  •    labels = torch.ones((num_objs,), dtype=torch.int64)*
    
  •    masks = torch.as_tensor(masks, dtype=torch.uint8)*
    
  •    image_id = torch.tensor([idx])*
    
  •    area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])*
    
  •    # suppose all instances are not crowd*
    
  •    iscrowd = torch.zeros((num_objs,), dtype=torch.int64)*
    
  •    target = {}*
    
  •    target["boxes"] = boxes*
    
  •    target["labels"] = labels*
    
  •    target["masks"] = masks*
    
  •    target["image_id"] = image_id*
    
  •    target["area"] = area*
    
  •    target["iscrowd"] = iscrowd*
    
  •    if self.transforms is not None:*
    
  •        img, target = self.transforms(img, target)*
    
  •    return img, target*
    
  • def len(self):*

  •    return len(self.imgs)*
    

However, in my case, I need to seperate the nose, mouth and background from face images.

I’d like to know that how should I change definition of Dataset class,(especially the getitem method), then my module could seperate the nose and mouth as different classes after training?

1 Like

How did you store the target masks?
If each class is using a separate class index, you could reuse most of the code and create the corresponding labels tensor based on the current class index.

For my mask data, each region of the face has its own binary mask image.(e.g. if the image name is 0.jpg, then the related masks are: 0_nose.png, 0_mouth.png).
In the Dataset class, I followed the methodology of the tutuorial. I convert each binary mask image into a [1, 512, 512] tensor,then concatenate the mouth and nose mask tensor along with 0 dim i.e my target masks is a [2, 512, 512] tensor. Then I set the labels tensor as torch.tensor([1, 2]).

Is it right for me to stack different classes masks along with 0-dim?

If you are using nn.CrossEntropyLoss (or nn.NLLLoss), then the target mask shape should be [batch_size, height, width]. I would therefore recommend to transform the binary masks to masks containing the class index (e.g. 0 for background, 1 for mouth and 2 for nose) and add them to a single target mask.

does this mean the channel dim of the mask tensor is just 1? I thought the size of the mask shape is [batch_size, channel, height, width]

The target for the mentioned criteria does not have a channel dimension, as it’s not one-hot encoded.
Instead it contains the class indices directly, so for a segmentation use case, your target should have the shape [batch_size, height, width].

Hi there,
I have the same problem as you. Can you tell me which codelines you changed?