PyTorch model to Onnx to TensorRT engine = no speed up for inference?

Hey everyone,

I’m working with a Jetson Nano device, TRT 6 (the latest version that can be used on the Nano), PyTorch 1.2.0 (compatible with TRT6), and Torchvision 0.4.0 (compatible with PyTorch 1.2.0).

I have a Torchvision Mobilenetv2 model I exported to Onnx with the built-in function:

    torch.onnx.export(pt_model, dummy_input, out_path, verbose=True)

I then built a TensorRt engine with this Onnx model:

with trt.Builder(TRT_LOGGER) as builder, \
      builder.create_network(*EXPLICIT_BATCH) as network, \
      trt.OnnxParser(network, TRT_LOGGER) as parser:

      builder.max_workspace_size = 1 << 28
      builder.max_batch_size = 1
      builder.fp16_mode = True
      # ... 
      engine = builder.build_cuda_engine(network)

Then I run inference on this new engine on my Jetson Nano device and can get a latency of about 0.045 seconds (22.2 fps). Running inference on the PyTorch version of this model also has almost the exact same latency of 0.045 seconds.

I also tried to change the mode to INT8 mode when building the TensorRT engine and get the error: Builder failed while configuring INT8 mode.

Anyone have experience with optimizing Torch models with TensorRT? Am I missing something fundamental when building the TensorRT engine or should I expect a speed-up?

Thanks.

1 Like

I haven’t tried to export Mobilenet yet, but could see a speedup for FasterRCNN.
How are you currently measuring the the throughput?

Thanks for this thread. I am also in the same boat, trying to figure out optimizing networks for jetson. @jdev Have you tried the int8 quantization pipeline, first calibrating images to int8 and then doing inference on it? I believe you will run into difficulties for operations that are not supported.

Nvidia’s retinanet-examples repo gives a working pipeline but it’s obviously much more useful to have this going for our own custom networks.

Reviving this topic as I have come to a similar conclusion. I have been testing and a variety of things but for a Resnet50 network with a large input size (3MP image), and with an RTX3070, I have come to the conclusion that tensorRT engine with fp32 is actually slower than cudnn inference on pytorch with JIT trace + cuda AMP:

Time for total prediction pytorch JIT = 0.06634163856506348
Time for total prediction trt = 0.07124924659729004

It is the same model which was exported from pytorch to ONNX and then converted from ONNX to trt. I have also tested torch2trt and TRTorch and so far only TRTorch appear to show some small gain in speed but both are still alpha projects with some problems with output not matching the original model.

How are you dealing with the learge memory usage of such a large image?

That’s a different problem. Indeed it can be tricky. It is fine for the purpose of the test but I can’t work with too many pictures at the same time. It does occupy roughly 2GB per image.