Guidance on Placement of Architecture-Specific Kernel Optimizations in PyTorch Codebase

Hello PyTorch Community,

I’m currently working on enhancing the performance of specific operations in PyTorch, targeting performance optimizations . My focus is on modifying kernel-level code within the aten/src/ATen/native/cpu/TensorCompareKernel.cpp file, particularly the min_kernel_impl and max_kernel_impl functions.My approach involves integrating CPU-specific optimizations.

I would like to ask for guidance on the following points:

Q1. Is it suitable to place CPU-specific, conditional (if-else) optimizations within TensorCompareKernel.cpp for functions like min_kernel_impl and max_kernel_impl?

Q2. If TensorCompareKernel.cpp isn’t the right place for such architecture-specific enhancements, could you suggest an alternative file or module within the PyTorch codebase where these optimizations could be appropriately integrated?

Q3. Any general advice or guidelines the community follows when contributing CPU/architecture-specific code to maintain the design philosophy and maintainability of PyTorch.

Insights, advice, or links to any relevant documentation or discussions would be highly valuable and appreciated.

Thank you for your time and help!

I would recommend creating an issue on GitHub explaining your approach and allowing code owners to review your suggestion. Generally, I would assume architecture-specific code can land, but since it’s CPU-specific I’m not deeply familiar with the structure.