There is substantial interest in using pytorch for realtime deployments in e.g. audio, robotics. To my knowledge the standard way to do this is using torchscript / libtorch C++ API. However, the jit interpreter causes realtime violations with memory allocation and thread synchronization (see ANIRA: An Architecture for Neural Network Inference in Real-Time Audio Applications | IEEE Conference Publication | IEEE Xplore).
I recall there was work on a “StaticRuntime” backend which would avoid most memory allocation when running the same module repeatedly, but it appears to have ceased, and now torch.jit itself seems to have been abandoned in favor of torch.compile, torch.export, AOTInductor and/or executorch (?)
At the moment it’s very hard to understand which APIs to use and what the intended tradeoffs are re:low latency applications, since most examples discuss vision or language modeling.
In particular, which technologies can be real-time friendly re: memory allocation? And which support internal mutable state the way torch.jit.ScriptModule does? Can someone knowledgeable comment on the near-future outlook for low-latency realtime inference?