PyTorch C++ Deployment Story: 2019

Hi PyTorch team,

What is the recommended approach for deploying python trained models to a high performance c++ runtime (ideally supporting accelerators) as of October 2019?

There seem to be many approaches right now and I’m confused as to:

  1. What is the best way right now?
  2. What will be the best way in 6-12 months? (I.e. where are we headed?)

Use case: We need the flexibility of Python for training but the performance of TensorRT (ideally) for deployment in a c++ env.

  1. What is the ultimate goal for torchscript/libtorch? Is it just for converting into c++? Or will it also provide direct support for hardware acceleration?
  2. Will torchscript/libtorch replace the need for Caffe2? How does ONNX fit into this?
  3. What is the recommended optimized runtime format? (Onnxruntime, glow, tensorrt, tvm)? Will your recommendation change in 6-12 months?
  4. Do we have any performance benchmarks for each of these formats? Python vs Libtorch? Libtorch vs Glow?

Apologies for the questions. Perhaps someone can write a summary article to cover this topic?

Thanks in advance!

2 Likes

Hi Brendan! Thanks for the question, it’s a good opportunity for us to provide some clarity in this area.

The PyTorch team is betting heavily on TorchScript/libtorch as the path for going from research to production. Our ideal workflow is for the user to prototype in Python/PyTorch eager, convert to TorchScript, then use our compiler infrastructure to optimize and potentially lower your model to specialized hardware.

You can check out our tutorials on TorchScript and exporting your model to C++ and our TorchScript reference for more information on using TorchScript to deploy your PyTorch models to production.

Today, the basic building blocks of that workflow are in place, but the extension point for hardware backends is the thing that we need to work on the most.

The (as of today) best approach to add hardware or compiler backends to our JIT is to replicate what we have in the pytorch/tvm repo. @bwasti has also written up a tutorial for the same integration strategy. It registers certain PyTorch operators as TVM-accelerated and the JIT offloads subgraphs with these operators to the TVM backend. Happy to answer any questions about that.

If you don’t need graphs to be built at runtime (say, you have a ResNet-ish trunk to your model that is highly stable and you want to guarantee it is compiled), you can compile in TVM/TensoRT/Glow/etc. directly, then just call that as a custom op in your model. For example, there is the Torch2TRT Converter that can convert ResNet-ish trunks into TRT, and your network will be partly the Python function that calls the TRT model, and the rest be the PyTorch-native model.

Going forward, we are looking at two directions:

  1. Improvements to the optimization and code generation capabilities of PyTorch’s native JIT runtime. We haven’t focused much here to date (busy writing TorchScript itself) but we are investing in this much more as TorchScript matures.
  2. A simple way to to say “export this nn.Module to X graph compiler”, with a similar interface to .to(), but works only on nn.Modules and not tensors, as well as the ability to use such compiled modules in TorchScript.

Combined, these two things will make the performance story a lot clearer for PyTorch. If you are just hoping for performance improvements without doing any work, the native JIT runtime should be “good enough”. If you are really trying to squeeze performance by tuning your model work well with a graph compiler (say TensorRT), you should be able to imperatively tell TorchScript “convert this module to a TensorRT graph or fail” and backend vendors can implement the conversions as they see fit.

As for your specific questions:

What is the ultimate goal for torchscript/libtorch? Is it just for converting into c++? Or will it also provide direct support for hardware acceleration?

We plan to provide direct support for hardware acceleration using TorchScript, in the manner described above.

Will torchscript/libtorch replace the need for Caffe2? How does ONNX fit into this?

TorchScript is intended as a replacement for PyTorch -> ONNX -> Caffe2 conversion. We think the experience is overall better, as we can precisely preserve the semantics of your model code and you don’t have to work with two separate frameworks.

That said, we will continue to support a PyTorch → ONNX conversion step, as it is a standard that we want PyTorch as a framework to interoperate with well. ONNX covers CoreML, ONNX.js etc. which are pretty useful (from what we heard).

What is the recommended optimized runtime format? (Onnxruntime, glow, tensorrt, tvm)? Will your recommendation change in 6-12 months?

We don’t have a recommended backend. They all have different strengths and weaknesses, so it will depend heavily on your use case and hardware which is right for you.

In 6–12 months, I think that recommendation will be the same, prefixed by "you should really consider using just the native JIT optimization, perhaps it will be good enough that you can avoid the integration cost of a secondary backend“

Do we have any performance benchmarks for each of these formats? Python vs Libtorch? Libtorch vs Glow?

We do not have comprehensive benchmarks between backends, no. Generally, it is somewhat difficult as graph compilers tend to be sensitive to changes in the model, so it’s hard to get a “fair” comparison everyone is happy with.

As for Python vs. TorchScript/JIT: to set expectations, we generally tell people “it should be about the same speed”, as we have not really started turning the optimization crank yet. As mentioned above, this is an area we are beginning to invest in quite heavily, so this should improve soon. We did some small work so far, and wrote about it here and you can see what that would generally look like as we do more stuff.

13 Likes

Thank you so much for your reply! I’m continually surprised and delighted by the responsiveness of the PyTorch team and community, even as the project matures! It’s a great time to be a PyTorch user :slight_smile:.

So if I were to summarize, the future for PyTorch deployment would look like this:

  • Users prototype models in Python, annotate them with TorchScript, and get a high-performance C++ runtime out of the box thanks to JIT
  • As JIT matures, PyTorch developers will increasingly focus on performance and try to close the gap with existing backends (TensorRT, TVM, Glow)
  • JIT performance will ideally be so good, users won’t have to integrate specialized backends, except in rare cases where they need that last drop of speed
  • If users DO need to integrate with other backends, PyTorch will make it easy to incorporate them, enabling the use of different compilers for different submodules in a graph (e.g. pytorch/tvm repo)
  • PyTorch will support ONNX for interoperability, but ideally JIT will make this unnecessary for use cases where performance is the goal

Does that look right? And to summarize the current state as of October 2019:

  • JIT performance is sufficient for many use cases, but not yet on par with other compilers (TVM, Glow, TensorRT)
  • TorchScript works well for simple architectures, but there is work needed to support more complex models like objection detectors and LSTMs
  • For those of us training object detectors, we have 2 options:
    – Change our python code to be compatible with the ops currently supported by TorchScript
    – Implement a custom TorchScript operation (docs)
  • If JIT performance isn’t sufficient and we need to squeeze the last drop of performance, the recommended approach is to add a custom compiler yourself (e.g. pytorch/tvm)
  • Another way to increase performance, for situations where you have a stable backbone, is to first compile the backbone with TVM/TensorRT, and then write a custom TorchScript op in your model (e.g. https://tutorials.pytorch.kr/advanced/torch_script_custom_ops.html <-- I think?)
  • A final way to increase performance (but not recommended), is to convert to ONNX, and then compile with a supported backend from there

And just to summarize my use case and try to answer my own question:

Use case

  • Rapidly prototype SOTA object detectors in Python
  • Deploy to a C++ environment with strict latency requirements

Recommendation

  • Update my python code to be compatible with TorchScript (example here) or write custom TorchScript ops
  • Add libtorch as a dependency in my stack and benchmark performance against current models
  • If performance is not sufficient, try porting the backbone to TensorRT and add a custom op in TorchScript
  • If performance is still not sufficient, write a custom compiler for TensorRT/TVM to compile specific ops (e.g. pytorch/tvm)
  • If all else fails, try converting to ONNX and explore converting to TensorRT/TVM from there (or try ONNXruntime)

Thanks again! Let me know if I’m missing anything!

6 Likes

Great summary! Thanks for your interest in PyTorch :slight_smile:

To give some context to this bullet:

  • TorchScript works well for simple architectures, but there is work needed to support more complex models like objection detectors and LSTMs

We have confidence that TorchScript is expressive enough to support complex models, but you’re correct that some legwork is involved. We showcased some work with LSTMs on our blog, and in the next version of Torchvision I believe all the models (including object detection models like MaskRCNN) will be TorchScript compatible.

As for your use case specifically, we have reason to believe PyTorch is a good fit; Uber ATG’s perception and prediction network uses a pure-TorchScript implementation, which (for obvious reasons :stuck_out_tongue:) has very similar requirements to yours. @sidazhang gave an amazing talk at our devcon about their experience using PyTorch for their end-to-end research to production.

The steps you outlined for moving forward seem like the right ones overall. I would add only that in between “check libtorch performance” and “try TensorRT if it’s not fast enough”, there are a number of ways to tune your PyTorch model for performance. Would be happy to discuss further when you get there!

3 Likes

Hi @bfortuner ,

Thanks for this awesome post. I have a question that you wrote about the approach of first converting the model to ONNX and compiling with a supported backend that “it is not recommended”. Could you explain about why is this approach not recommended ?

This tutorial link seems broken…

The link is working for me; you might need to refresh

Thanks! now I can visit the link :slight_smile:

Yeah, I’ll let @Michael_Suo elaborate if he wants, but in my understanding:

  1. Converting models between 2 frameworks is hard. Each framework needs to support the same operations and the conversion process introduces opportunities for bugs
  2. Converting models between 3 frameworks (PT --> ONNX --> TrT) introduces more opportunities for failures or incompatibilities
  3. PyTorch core developers are focused on JIT and will not invest significant effort in ONNX

Today, based on my reading and testing, going from PyTorch --> ONNX --> TensorRT will be easier for some use cases, but I think that will change in the future, and the community should invest in JIT --> TensorRT conversion directly.

The torch2trt repo below works well for simple / popular models like ResNet, which makes it possible to deploy hybrid models of JIT + TrT (most object detectors will work like this). But it’s tracing-based, unlike ONNX --> TrT which does direct graph parsing, so doing a proper port of a complex architecture would require some surgery.

Anyway, I hope the PyTorch/Nvidia developers will consider collaborating on the Torch2Trt repo to make it better in the future. In my performance testing, TensorRT is at least 2x faster than raw JIT (I don’t see any speedups for JIT over raw PyTorch for any architecture except a tiny benefit from c++ runtime) for architectures like ResNet, however the hybrid models (ResNet backbone in TrT, Object Detector head in JIT), can close that gap considerably.

I’d like to see PyTorch developers invest more in JIT performance, interoperability with 3P accelerators, detailed documentation and examples of how to squeeze out model performance with JIT, and provide some benchmarks on the various combinations.

1 Like

I think @bfortuner’s summary is accurate.

Luckily, this looks almost exactly like what we’re planning to work on over the next year :stuck_out_tongue:

cc @dzhulgakov @ngimel as this has relevance for your “performance as a product” stuff.

@Michael_Suo MKLDNN being disabled on Windows rights now means libtorch is right now a no-go for us windows users. I’m considering a mix of openCV inference engine on cpu and tensorRT on gpu.
Is it a stupid idea and should I just wait a few weeks for libtorch to be updated?