For context: I’d like to help non-tech savvy users enjoy JIT acceleration in applications like ComfyUI and Unsloth. Recently I published triton-windows wheels to PyPI, which bundles TinyCC and ptxas, so users can ‘just pip install’ it (without manually setting up compiler toolchains, which is a pain on Windows). It also helps downstream developers properly package their applications.
This works as long as torch.compile
targets GPU. However, users still need a C++ compiler if they ‘accidentally’ run torch.compile
targeting CPU. If I understand correctly, this can happen because of the graph break, although I don’t yet have a minimal reproducer for this.
Is there a mechanism to let torch.compile
only target GPU, and skip compiling the CPU part?
(By the way… I don’t think there is a C++ compiler toolchain that is small enough to get bundled in the wheels, while complete enough for torch.compile
targeting CPU. If you know one, please tell me)