Tracing based selective build for cuda kernels

Hey, I’m trying to reduce the pytorch installation for running inference. I have a limited set of models that I’d like to run. I saw that there exists a way to trace models to create a selective build for cpu (PyTorch’s Tracing Based Selective Build | PyTorch). I was wondering if that would also be possible to do something similar to reduce the size of and by only including the kernels that are used by these models?

Good timing! There’s a discussion posted today Enable Link Time Optimization in PyTorch 2.0 Release Binaries - Smaller, Faster, Better Binaries · Issue #93955 · pytorch/pytorch · GitHub that you can support

