Now that I got a PyTorch build with the latest TensorRT working I’m pondering.
VoltaML uses TensorRT to achieve a huge speed up by ‘accelerating’ the model(ckpt file) into 3 plan files which can then be used. When I got VoltaML to work I went from 39.5 it/s to 88 it/s.
model = torch.compile(model, backend=“tensorrt”)
seems to work but I see no signs of any speed up. Also, PyTorch 2.0 markets itself as having little to no perf improvements for inference. Why would a compiled TensorRT model produced by torch not have the same kind of speed up as I see with someone else’s tensorRT compiled version of the same stable diffusion model?
I don’t know how exactly VoltaML creates the computation graph and what is provided to TensorRT, but in general you would see the largest speedup if TensorRT is allowed to optimize the entire graph (or as much as it can).
Different backends could break the computation graph into sub graphs, which could decrease the potential speedup you could achieve.
CC @narendasan who might know more details.
I should have mentioned that VoltaML takes 20 minutes for their one time generation of the saved plan. Also, later after I posted this I found something called “Torch-TensorRT” which is its own package that claims up to 6x faster inference. This was announced in Dec 02 2021 on NVidia’s own dev blog site:
Accelerating Inference Up to 6x Faster in PyTorch with Torch-TensorRT
@ptrblck Is right, it really comes down to how the graph is broken up. I don’t know how VoltaML splits up the graph or if it uses some third party IR like ONNX but for Torch-TensorRT it comes down to support for converting operations into TRT ops. If ops don’t have a way to be converted, they are left in PyTorch. There is also may be cost associated with switching between PyTorch and TensorRT execution many times.