Segment Reduce memory problems while building?

I’m trying to build PyTorch from source (cloned 7a79de1) on a fresh Windows 11 system (Ryzen 7 5800XT, 32GB DDR4-2666 RAM) with an NVIDIA RTX 5060 Ti 16GB and CUDA 12.9. Building from source is a critical requirement as existing PyTorch builds don’t yet support this GPU.

My build consistently fails with an “LLVM ERROR: out of memory” and nvcc error ('""%CICC_PATH%\cicc"' died with status 0xC0000409) when compiling aten\src\ATen\native\cuda\SegmentReduce.cu. The log shows this is specifically during the object file generation for this .cu file.

I’ve already taken several troubleshooting steps:

  • Increased Windows Virtual Memory (Page File): Set to Initial: 49152 MB (48GB) & Maximum: 65536 MB (64GB). PC was restarted.
  • Limited Build Jobs: Ran set MAX_JOBS=1 before python setup.py install.
  • Clean Build: Deleted the build directory before each attempt.
  • Environment: Using Visual Studio 2022 Community (MSVC 19.44.35208.0) Developer Command Prompt, Python 3.10.6 in a virtual environment (sd_env), and all Python build dependencies installed.
  • System Checks: Windows Memory Diagnostic reported no errors. Event Viewer showed no relevant errors during previous failures.

The build log (partial paste below) clearly shows the failure at SegmentReduce.cu. It seems the compiler (or related tools) is hitting a memory limit even with substantial virtual memory and single-threaded compilation.

[3400/7521] Building CUDA object caffe2\CMakeFiles\torch_cuda.dir\__\aten\src\ATen\native\cuda\SegmentReduce.cu.obj [cite: 1983] FAILED: caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SegmentReduce.cu.obj [cite: 1983] C:\PROGRA~1\NVIDIA~2\CUDA\v12.9\bin\nvcc.exe -forward-unknown-to-host-compiler ... -x cu -c C:\StableDiffusion\pytorch\aten\src\ATen\native\cuda\SegmentReduce.cu -o caffe2\CMakeFiles\torch_cuda.dir\__\aten\src\ATen\native\cuda\SegmentReduce.cu.obj -Xcompiler=-Fdcaffe2\CMakeFiles\torch_cuda.dir\,-FS [cite: 1983, 1987] LLVM ERROR: out of memory [cite: 1988] SegmentReduce.cu [cite: 1988] nvcc error : '""%CICC_PATH%\cicc"' died with status 0xC0000409 [cite: 1988] ninja: build stopped: subcommand failed. [cite: 1988]

Could this be a compatibility issue between CUDA 12.9 and the RTX 5060 Ti’s specific architecture during this particular compilation step? Or is there another compiler flag/environmental variable that might help with memory management during nvcc’s LLVM backend process?

Our nightly binaries support Blackwell architectures for a few months already as well as the latest stable PyTorch 2.7.0 release, so just install one of these binaries if you are running into issues building from source.

Hi @NateTalley any luck finding workaround for this issue ? Looks like we are hitting it in PyTorch CD now as well: GitHub · Where software is built . We are using CPU machines to build CUDA 12.9 PyTorch Wheels.