Image segmentation debugging using the diffusion model and detectron2

I am training a diffusion model built on top of detectron2 on the Publaynet dataset for instance segmentation. But the output I am getting after multiple iterations is it is segmenting the whole document as the figure. It doesn’t segment individual elements like tables and texts in the document. The loss function is decreasing very well at a learning rate of 0.00005. Total loss decreases to 1.6. Loss_bbox is around 0.200 and loss:giou is 0.3573.

I did the sanity check the input in the model seems to be correct. I visualized the input and bounding boxes and also checked the targets. Everything seems to be correct. I don’t know if I should train the model further or change any other hyperparameters. I tried with multiple learning rate this learning rate seems to be good. NUM_PROPSALS is 500. Should I raise to 1000. WHat hyperparameters should I be specifically care about. The code is not from scratch. I am taking the repo from here GitHub - chenhaoxing/DiffusionInst: This repo is the code of paper "DiffusionInst: Diffusion Model for Instance Segmentation" (ICASSP'24).. So I am not building the model from scratch. Let me know if anyone has any idea.

Below is the hyperparameters list from config file.

MODEL:
  META_ARCHITECTURE: "DiffusionInst"
  WEIGHTS: "detectron2://ImageNetPretrained/torchvision/R-50.pkl"
  PIXEL_MEAN: [123.675, 116.280, 103.530]
  PIXEL_STD: [58.395, 57.120, 57.375]
  BACKBONE:
    NAME: "build_resnet_fpn_backbone"
  RESNETS:
    OUT_FEATURES: ["res2", "res3", "res4", "res5"]
  FPN:
    IN_FEATURES: ["res2", "res3", "res4", "res5"]
  ROI_HEADS:
    IN_FEATURES: ["p2", "p3", "p4", "p5"]
  ROI_BOX_HEAD:
    POOLER_TYPE: "ROIAlignV2"
    POOLER_RESOLUTION: 7
    POOLER_SAMPLING_RATIO: 2
SOLVER:
  IMS_PER_BATCH: 2
  BASE_LR: 0.0000125
  STEPS: (210000, 250000)
  MAX_ITER: 270000
  WARMUP_FACTOR: 0.01
  WARMUP_ITERS: 1000
  WEIGHT_DECAY: 0.0001
  OPTIMIZER: "ADAMW"
  BACKBONE_MULTIPLIER: 1.0  # keep same with BASE_LR.
  CLIP_GRADIENTS:
    ENABLED: True
    CLIP_TYPE: "full_model"
    CLIP_VALUE: 1.0
    NORM_TYPE: 2.0
SEED: 40244023
INPUT:
  MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800)
  CROP:
    ENABLED: False
    TYPE: "absolute_range"
    SIZE: (384, 600)
  FORMAT: "RGB"
TEST:
  EVAL_PERIOD: 20000
DATALOADER:
  FILTER_EMPTY_ANNOTATIONS: False
  NUM_WORKERS: 2
VERSION: 2