Transformer decoder output ends up in a "trivial solution" for instance segmenttaion

Hi everyone,

I am trying an instance segmentation task in the DETR bbox prediction style. I am using a pretrained backbone features as input to a transformer and represent the sequence tokens as the number of instances in my dataset. I am trying to predict instance masks and end up in predicting the same masks for every sample where the predicted mask contains all instances for all samples, instead of the expected one mask per instance.

  • I have multiple instances in a frame.
  • Ground truth masks in each image are separated (1 unique mask for each instance).

I have a softmax in the output, been trying different learning rates, and initialization, but this did not improve the case. Could someone please help as to if they faced similar situations, and how did you overcome this type of error in an instance segmentation task?

Thanks in advance!