PyTorch C++ Deployment Story: 2019

Michael_Suo · October 14, 2019, 10:14pm

Hi Brendan! Thanks for the question, it’s a good opportunity for us to provide some clarity in this area.

The PyTorch team is betting heavily on TorchScript/libtorch as the path for going from research to production. Our ideal workflow is for the user to prototype in Python/PyTorch eager, convert to TorchScript, then use our compiler infrastructure to optimize and potentially lower your model to specialized hardware.

You can check out our tutorials on TorchScript and exporting your model to C++ and our TorchScript reference for more information on using TorchScript to deploy your PyTorch models to production.

Today, the basic building blocks of that workflow are in place, but the extension point for hardware backends is the thing that we need to work on the most.

The (as of today) best approach to add hardware or compiler backends to our JIT is to replicate what we have in the pytorch/tvm repo. @bwasti has also written up a tutorial for the same integration strategy. It registers certain PyTorch operators as TVM-accelerated and the JIT offloads subgraphs with these operators to the TVM backend. Happy to answer any questions about that.

If you don’t need graphs to be built at runtime (say, you have a ResNet-ish trunk to your model that is highly stable and you want to guarantee it is compiled), you can compile in TVM/TensoRT/Glow/etc. directly, then just call that as a custom op in your model. For example, there is the Torch2TRT Converter that can convert ResNet-ish trunks into TRT, and your network will be partly the Python function that calls the TRT model, and the rest be the PyTorch-native model.

Going forward, we are looking at two directions:

Improvements to the optimization and code generation capabilities of PyTorch’s native JIT runtime. We haven’t focused much here to date (busy writing TorchScript itself) but we are investing in this much more as TorchScript matures.
A simple way to to say “export this nn.Module to X graph compiler”, with a similar interface to .to(), but works only on nn.Modules and not tensors, as well as the ability to use such compiled modules in TorchScript.

Combined, these two things will make the performance story a lot clearer for PyTorch. If you are just hoping for performance improvements without doing any work, the native JIT runtime should be “good enough”. If you are really trying to squeeze performance by tuning your model work well with a graph compiler (say TensorRT), you should be able to imperatively tell TorchScript “convert this module to a TensorRT graph or fail” and backend vendors can implement the conversions as they see fit.

As for your specific questions:

What is the ultimate goal for torchscript/libtorch? Is it just for converting into c++? Or will it also provide direct support for hardware acceleration?

We plan to provide direct support for hardware acceleration using TorchScript, in the manner described above.

Will torchscript/libtorch replace the need for Caffe2? How does ONNX fit into this?

TorchScript is intended as a replacement for PyTorch → ONNX → Caffe2 conversion. We think the experience is overall better, as we can precisely preserve the semantics of your model code and you don’t have to work with two separate frameworks.

That said, we will continue to support a PyTorch → ONNX conversion step, as it is a standard that we want PyTorch as a framework to interoperate with well. ONNX covers CoreML, ONNX.js etc. which are pretty useful (from what we heard).

What is the recommended optimized runtime format? (Onnxruntime, glow, tensorrt, tvm)? Will your recommendation change in 6-12 months?

We don’t have a recommended backend. They all have different strengths and weaknesses, so it will depend heavily on your use case and hardware which is right for you.

In 6–12 months, I think that recommendation will be the same, prefixed by "you should really consider using just the native JIT optimization, perhaps it will be good enough that you can avoid the integration cost of a secondary backend“

Do we have any performance benchmarks for each of these formats? Python vs Libtorch? Libtorch vs Glow?

We do not have comprehensive benchmarks between backends, no. Generally, it is somewhat difficult as graph compilers tend to be sensitive to changes in the model, so it’s hard to get a “fair” comparison everyone is happy with.

As for Python vs. TorchScript/JIT: to set expectations, we generally tell people “it should be about the same speed”, as we have not really started turning the optimization crank yet. As mentioned above, this is an area we are beginning to invest in quite heavily, so this should improve soon. We did some small work so far, and wrote about it here and you can see what that would generally look like as we do more stuff.