Backend-agnostic serialization / inference workflow

Is there currently a workflow for torch.exporting / serializing a nn.Module and being able to load and run inference on it on all of the available executorch backends (like it is possible with torchscript). Or do I need one artifact per target backend?

We require one artifact per backend. Do you have a use case that would benefit from specializing to a backend dynamically?

Yes, cpu fallback. We deploy a wide range of models (currently torchscript and onnx) to users and some of them need to fallback to cpu because their hardware is not sufficient (this is usually only detected at runtime).

Thanks. We have been exploring options for multi-delegation / CPU fallback, though nothing concrete yet. Most use cases currently use a model distribution system to download the specialized model file depending on the runtime capabilities.

I see! That might be the best option. Just one more question, will the CUDA backend be hardware agnostic, so models exported from one CUDA arch still run on other archs? Or does it need one artifact per arch?

I’ll tag one of the CUDA experts here - @larryliu0820. Do you know the answer to the above? Thanks.

Yeah right now what we are doing is basically what AOTInductor supports. Ideally we should be able to have a .ptx based solution that works on various different cuda arch, but we don’t have that right now. So yeah currently it is 1 artifact per arch. If you need support across architectures, please create an issue in pytorch/executorch.

Thank you for the answer. Done: Architecture Agnostic CUDA artifact · Issue #17666 · pytorch/executorch · GitHub