Multi-class Multi-label image segmentation for beginners

So I’m a self-admitted noob but with computer vision it looks like I dived straight into the deep end and I think I touched bottom

I started a couple of weeks ago with but quickly realized I needed way more. I moved to detectron2 and just got that working last night (using mask-rcnn to generate image segmentation masks) but I realized again that I need a little more.

Since it looks like detectron2 leans on pytorch, I figured Id ask here to

Is there a model out there that can do multi-label image segmentation? If there isnt is there a way to take mask-rcnn and change its classification layers to return multiple classes per mask region?

I could build and run multiple trained models side by side (like where I started with lobe) but the detectron2 models are kinda chunky and I dont want to load a few gigs of models to generate output

I know I’m asking questions way over my head, be gentle :stuck_out_tongue_winking_eye:

Hi Andrew!

In short: Yes, you can perform multi-label, multi-class image segmentation,
and image-segmentation architectures can relatively easily be modified to
do so (or already do so).

Let me draw a distinction between instance segmentation (which is what
Mask R-CNN does), the easier problem of semantic segmentation (such
as performed by U-Net), and object detection (which is “simpler” in that
it doesn’t assign specific pixels to specific classes or instances).

In instance segmentation, you assign each pixel to a specific instance of
an object of a given class. Thus, you might say that this pixel belongs to
the second person in the image and that pixel belongs to the third dog in
the image. So instance segmentation can naturally be multi-class (that
is person vs. dog vs. pelican). But it is less naturally multi-label. That is,
a given pixel would not typically be assigned to more than one class (not
both a person and a dog). (I could imagine inventing such a use case, but
it would seem contrived.)

Note, that as it stands, Mask R-CNN can and does generate overlapping
“mask regions” (by which I assume that you mean bounding boxes). So
bounding box 1 could be “person” and it could overlap with bounding box
2, which could be “dog.” The masks within the bounding boxes are then
essentially independent of one another. Although you would hope it
wouldn’t, it could well turn out that the predicted mask for the person
in bounding box 1 shares some pixels with the predicted mask for the
dog in bounding box 2. So, although not its typical use case, in some
sense Mask R-CNN is already multi-label.

In the case of semantic segmentation (for which again a multi-label use
case would be somewhat contrived), it is perfectly straightforward to take
something like a (multi-class) U-Net, interpret its final per-pixel class
predictions as multi-label predictions, train it against multi-label target
(ground-truth) data, typically replacing a CrossEntropyLoss loss
criterion with BCEWithLogitsLoss.


K. Frank

Hi, K. Frank!

Thanks for writing back. I appreciate you putting up with my excited noob questions (I very much know that pain) and for really breaking things down. I didnt even realize the difference in instance vs semantic segmentation.

Originally I had thought of running multiple models against the image collecting the masks and combining them (kind of like I originally did with’s basic image classification). The models are kind of chunky though so that kind of turns into its own pain.

Thats why I asked if it was possible to modify like the model itself, since having one model do it would definitely be the simplest solution
(as an aside though one of my predictions from my quick and dirty concept did have some slightly overlapping regions, but the second label for the zone never got predicted at all. Im using detectron2 and now Im wondering if it did some kind of data cleanup in there somewhere to remove the identical training polygons, I wonder if I could use random to jiggle them a tiny amount)

Maybe I could use a heavy model to predict a region and then run a multi-label object detection model over top of it to get additional features and see which region they mostly overlapped with.

Either way thanks for the great explanation! I very much appreciate the reply


Hey guess what K. Frank, detectron2 IS doing some kinda of data pruning, I jiggled my polygon a bit and it successfully predicted two regions on top of each other from one model.

It’s not exactly multi-label, but its close enough I can combine the regions and get the results I’m after I think!