How can I make a smaller version of libtorch for deployment?

I want to deploy a standalone executable, for a variety of Linux execution environments, and use CUDA-enabled libtorch for it. To facilitate this, I want to package everything together in as small a binary as possible. One issue, aside from CUDA libraries, is libtorch_cuda.so. When I use nvprune to try to make it smaller by removing some older architectures, it complains that the .so file is not relocatable, so it fails. Are there any easy ways to get rid of certain archs for libtorch_cuda.so? Strip kernels that I don’t use in practice? Statically link to my own pruned versions CUDA libraries like libcublas?

Any thoughts on this, or other tricks or ideas for reducing size?

You could build PyTorch from source for the desired architectures only and wouldn’t need to prune them. This should also allow you to statically link other dependencies into PyTorch, but I’m unsure how well tested this approach is.