Link-Time Optimization support

I am curious as to whether there is any way to enable llink-time optimization (LTO) when building PyTorch from source.

NVIDIA has introduced LTO as a stable(?) feature as of CUDA 11.2.

Also, both LLVM and GCC support LTO via compile flags.

Though I am no expert, LTO appears to give very good speedups at the cost of longer compile times.

Is LTO currently supported on PyTorch? If so, how do I enable it?

To be more precise, I would like to know how to specify compiler and linker options for PyTorch during the build.

To enable LTO and other optimizations, the compiler must be given -flto flags at both compile and link stages. However, the structure of PyTorch is so complicated that it is hard for me to tell how to give flags for specified stages and check if they have been given properly.