This is perfectly reasonable for Mask R-CNN (although it may indicate
that you are working on a problem that is inherently more difficult to train).
Mask R-CNN performs instance segmentation. It is conceptually fine
to have instance-1 of class-A be contained in instance-2 also of class-A.
You could also have instance-1 of class-A be enclosed in instance-1 of
class B.
Mask R-CNN can be applied to such problems (assuming enough training
data of good-enough quality, and so on), although if you told me that such
a use case would tend to be harder to train, I would believe you.
I found the error in my code. I was feeding the model with mask tensor [N, 1, X, Y] in size which is the output shape of the model and not [N, X, Y] that was required as the feed tensor shape.
It seems often peculiar with these models that they often accept oddly shaped inputs. I had a similar error a long time ago when I fed labels in [1, N] tensor and not [N] tensor as was required.