How to set mask labels for mask R-CNN so that I can fine-tune it into a 3-classes classification and segementation model?

crissallan · January 9, 2020, 6:48am

Hello, I recetently followed the tutorial on the Pytorch Official Webset（Link） to fine-tune a Mask R-CNN.

In the tutorial, the author just need to seperate the pedstrain and the background. So the definition of the Dataset class looks like this:
class PennFudanDataset(object):

def init(self, root, transforms):*
```
   self.root = root*
```
```
   self.transforms = transforms*
```

   # load all image files, sorting them to*

```
   # ensure that they are aligned*
```

   self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))*

   self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))*

def getitem(self, idx):*
```
   # load images ad masks*
```

   img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])*

   mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])*

   img = Image.open(img_path).convert("RGB")*

   # note that we haven't converted the mask to RGB,*

   # because each color corresponds to a different instance*

```
   # with 0 being background*
```
```
   mask = Image.open(mask_path)*
```

   # convert the PIL Image into a numpy array*

```
   mask = np.array(mask)*
```

   # instances are encoded as different colors*

```
   obj_ids = np.unique(mask)*
```

   # first id is the background, so remove it*

```
   obj_ids = obj_ids[1:]*
```

   # split the color-encoded mask into a set*

```
   # of binary masks*
```

   masks = mask == obj_ids[:, None, None]*

   # get bounding box coordinates for each mask*

```
   num_objs = len(obj_ids)*
```
```
   boxes = []*
```
```
   for i in range(num_objs):*
```
```
       pos = np.where(masks[i])*
```
```
       xmin = np.min(pos[1])*
```
```
       xmax = np.max(pos[1])*
```
```
       ymin = np.min(pos[0])*
```
```
       ymax = np.max(pos[0])*
```

       boxes.append([xmin, ymin, xmax, ymax])*

   # convert everything into a torch.Tensor*

   boxes = torch.as_tensor(boxes, dtype=torch.float32)*

```
   # there is only one class*
```

   labels = torch.ones((num_objs,), dtype=torch.int64)*

   masks = torch.as_tensor(masks, dtype=torch.uint8)*

```
   image_id = torch.tensor([idx])*
```

   area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])*

   # suppose all instances are not crowd*

   iscrowd = torch.zeros((num_objs,), dtype=torch.int64)*

```
   target = {}*
```
```
   target["boxes"] = boxes*
```
```
   target["labels"] = labels*
```
```
   target["masks"] = masks*
```
```
   target["image_id"] = image_id*
```
```
   target["area"] = area*
```
```
   target["iscrowd"] = iscrowd*
```
```
   if self.transforms is not None:*
```

       img, target = self.transforms(img, target)*

```
   return img, target*
```
def len(self):*
```
   return len(self.imgs)*
```

However, in my case, I need to seperate the nose, mouth and background from face images.

I’d like to know that how should I change definition of Dataset class,(especially the getitem method), then my module could seperate the nose and mouth as different classes after training?

ptrblck · January 9, 2020, 7:47am

How did you store the target masks?
If each class is using a separate class index, you could reuse most of the code and create the corresponding labels tensor based on the current class index.

crissallan · January 9, 2020, 8:53am

For my mask data, each region of the face has its own binary mask image.(e.g. if the image name is 0.jpg, then the related masks are: 0_nose.png, 0_mouth.png).
In the Dataset class, I followed the methodology of the tutuorial. I convert each binary mask image into a [1, 512, 512] tensor，then concatenate the mouth and nose mask tensor along with 0 dim i.e my target masks is a [2, 512, 512] tensor. Then I set the labels tensor as torch.tensor([1, 2]).

Is it right for me to stack different classes masks along with 0-dim?

ptrblck · January 9, 2020, 9:28am

If you are using nn.CrossEntropyLoss (or nn.NLLLoss), then the target mask shape should be [batch_size, height, width]. I would therefore recommend to transform the binary masks to masks containing the class index (e.g. 0 for background, 1 for mouth and 2 for nose) and add them to a single target mask.

crissallan · January 9, 2020, 9:43am

does this mean the channel dim of the mask tensor is just 1? I thought the size of the mask shape is [batch_size, channel, height, width]

ptrblck · January 9, 2020, 5:13pm

The target for the mentioned criteria does not have a channel dimension, as it’s not one-hot encoded.
Instead it contains the class indices directly, so for a segmentation use case, your target should have the shape [batch_size, height, width].

maren11 · August 8, 2020, 7:53am

Hi there,
I have the same problem as you. Can you tell me which codelines you changed?