Inference time around 16s for torchvision.models.detection.fasterrcnn_resnet50_fpn_v2 trained on custom data

Hello,

I’m fairly new to pytorch and ML in general, I have been able to successfully retrain
a pretrained and modified torchvision.models.detection.fasterrcnn_resnet50_fpn_v2 model using around 13,000 training images and 4000 validation images.

The application is medical detection of muscle groups from images. It is important for this application that IoU is accurate because this will be used in a machine vision image processing pipeline.

The idea is to export the the trained model to ONNX and execute it on an embedded device based on an intel 4-core CPU.

The model result is generally very good, and I achieved a final IoU of around 0.92 on my validation set during training.
When running the model from pytorch and ONNX on images not in training or validation data, the accuracy is as expected.

The problem is the inference time is extremely slow, around 16 seconds or longer for a single image when using ORT, and is much slower than that when using pytorch to run the model.

The model was trained on a Nvidia 3060 GPU with 8Gb, and inference is performed on a CPU. Will that have an effect?

The modifications that were made to the model for training on my images are shown below:

    # Load the pre-trained Faster R-CNN model with ResNet-50-FPN backbone
    model = torchvision.models.detection.fasterrcnn_resnet50_fpn_v2(weights=torchvision.models.detection.FasterRCNN_ResNet50_FPN_V2_Weights.DEFAULT)

    # Modify the model architecture
    num_classes = 2  # Number of classes including the background class 

    # Get the number of input features for the box predictor
    in_features = model.roi_heads.box_predictor.cls_score.in_features

    # Replace the box predictor with your own
    model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes)

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)

    # Define the optimizer
    optimizer = optim.Adam(model.parameters(), lr=learning_rate, betas=betas, weight_decay=weight_decay)

    # the scheduler will reduce the learning rate depending on validation loss
    scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=learning_rate_step, patience=3, verbose=True)

....

    #training complete
    # Save the trained model
    torch.save(model.state_dict(), "Detector.pth")

    # Export the model to ONNX format
    dummy_input = torch.randn(1, 3, 720, 576).to(device)  # Create a dummy input
    onnx.export(model, dummy_input, "Detector.onnx", opset_version=14)

I have several ideas of what might be done to try and improve the model. For instance I could retrain the entire model from scratch rather than using initial pretrained weights. Retraining the model takes about 7 hours on my GPU.

I could use quantization aware training
I could try to use ONNX quantization on the model.
I don’t know whether torch quantization will work for ONNX export, I have read that it is not supported.

I could resize my images to 300x300, but I would rather they remain at their size if possible of 720x576x3 so adequate detail is maintained.

Does anyone know what may be done to achieve lower inference times? Even 1 second inference time per image would be acceptable for this application.

Thank you

I decided to abandon the idea of faster RCNN and move to YOLOv5.

I trained a model using the same set of images, exported the trained model weights to ONNX, and it performs inferences at around 250ms per image, with higher accuracy.

The main drawback is the rescaling of the image input, the sorting of the output to locate the bounding box with highest confidence, then the unscaling of the box back to the original image aspect, but this isn’t too difficult to achieve.

I’m working on integrating this new model into my application now, and I am not sure yet but I believe this inference speed may be fast enough to achieve realtime detection using the raw video stream acquired by my application, rather than capturing a still from the stream first. If so, it will definitely enhance the visual aspect!

Thanks!

Karl.