Future of realtime inference in pytorch

victor-shepardson · February 10, 2025, 2:59pm

There is substantial interest in using pytorch for realtime deployments in e.g. audio, robotics. To my knowledge the standard way to do this is using torchscript / libtorch C++ API. However, the jit interpreter causes realtime violations with memory allocation and thread synchronization (see ANIRA: An Architecture for Neural Network Inference in Real-Time Audio Applications | IEEE Conference Publication | IEEE Xplore).

I recall there was work on a “StaticRuntime” backend which would avoid most memory allocation when running the same module repeatedly, but it appears to have ceased, and now torch.jit itself seems to have been abandoned in favor of torch.compile, torch.export, AOTInductor and/or executorch (?)

At the moment it’s very hard to understand which APIs to use and what the intended tradeoffs are re:low latency applications, since most examples discuss vision or language modeling.

In particular, which technologies can be real-time friendly re: memory allocation? And which support internal mutable state the way torch.jit.ScriptModule does? Can someone knowledgeable comment on the near-future outlook for low-latency realtime inference?

smth · February 10, 2025, 3:57pm

Hey Victor.

now torch.jit itself seems to have been abandoned in favor of torch.compile, torch.export, AOTInductor and/or executorch (?)

This is correct. torch.exported programs can be targeted to run on AOTInductor or executorch.

If you are looking for a static runtime, Executorch interpreter qualifies.

for low-latency realtime inference for edge devices, executorch keeps up that discipline.
AOTInductor also avoids any runtime dynamism as such, but does not avoid dynamic memory allocations.