How to build pytorch without Cuda statistical linkage

I am intercepting some CUDA APIs during training, but certain kernels (e.g., cudaLaunchKernel) are not being intercepted by my code. I suspect this is because some kernels are compiled with static linkage, especially those from third-party libraries. Is there a way to force dynamic linkage when building PyTorch to address this issue?

You could try to set CUDA_USE_STATIC_CUDA_RUNTIME=0 and rebuild PyTorch.