Need help reducing the number of bounding boxes in Mask R-CNN inference

I am currently working on an object detection project using Mask R-CNN in PyTorch. Also, I want to export the trained model to ONNX format for deployment in production. However, I am encountering challenges during the inference process as I am getting bounding boxes all over my video stream. I used CVAT for annotation of my videos and I am using only 1 class the other is the background.

Training Process:
I am using a custom class to load video data and annotations. The dataset is then fed into a Mask R-CNN model with a ResNet-50 backbone and I am training it for 15 epochs.

Inference Process:
During inference, I am generating a large number of bounding boxes for each frame of the video input leading to cluttered visualisation and inaccurate detections. I have attempted to filter out redundant detections using IoU thresholding and NMS but the result were not satisfactory.

Additional Information:
PyTorch Version: 1.10.1+cu111
CUDA Version: 11.2
OS: Ubuntu 20.04