Customizing PyTorch images on different arch

Hello guys,

I’ve been working extensively on customizing PyTorch images, particularly focusing on enabling and disabling various options like USE_CUSPARSELT, USE_NCCL, and others. Given the wide range of build flags involved (e.g., CUDA, cuDNN, TensorPipe, XNNPACK, MKLDNN, OpenMP, and others), managing these configurations dynamically across different environments (A100, H100, etc.) is becoming increasingly complex.

I’m looking for guidance on how to streamline this process effectively within PyTorch. Specifically, I want to ensure that I can enable/disable these flags seamlessly depending on the target environment while keeping the build process as efficient as possible.

Is there someone I could connect with who specializes in managing PyTorch builds and configurations? I’d like to discuss best practices for automating this process, especially when it comes to handling flags like USE_CUSPARSELT, where certain features are still under development. Any insights on how to manage this complexity effectively would be greatly appreciated.

Thanks in advance for your help!

Could you describe your actual use case and why you want to disable certain features?
While it’s true that some features are experimental, it would still be irrelevant which device you are using. E.g. if you are not interested in using cuSPARSELt, you can just globally disable it.

Thanks for the input! To give you some context, we are juggling different versions of PyTorch, NCCL, Transformer Engine (TFE), and Flash Attention (FA) across our environment. Managing these versions requires us to carefully enable or disable features based on compatibility and performance needs. For example, we recently had to explicitly enable cuSPARSELt (recompile torch) to test and optimize sparse matrix operations on specific hardware because we had issues with the nightly build.

Our goal is to ensure that the right set of features is activated for each build, as we often deal with experimental flags and custom builds. This level of control helps us tailor the performance to the workloads we are running, especially on diverse devices like A100s and H100s.

Having a flexible way to manage these flags across different versions would be invaluable for us. Do you have any suggestions for making this process smoother?

Thanks again for your help!