NaNs with torch.compile (inductor, mode="reduce-overhead"), OOM with backend="cudagraphs"

Hi,

I’m running into two related issues when trying to speed up a very large model with torch.compile and CUDA Graphs:

  1. NaNs when using backend="inductor", mode="reduce-overhead"

  2. OOM when using backend="cudagraphs" to avoid fusion

Because of company policy and the scale of the system, I cannot share the full model code or a minimal repro script (details at the end), so I’m mainly looking for guidance / best practices and to understand whether what I’m seeing matches known limitations or bugs.


1. Setup

  • Model: very large, deeply stacked model (thousands of small custom modules), total model code is O(10,000+) lines.

  • Training setup: large-scale distributed training (needs dozens of machines / GPUs to reproduce the issue; on a single machine the problem does NOT show up).

  • Precision: [mixed precision,FP16]

  • PyTorch version: [2.4.0]

  • CUDA version: [12.1]

  • GPUs: 8rank(L20) with 40+nodes

  • OS: [Linux ubuntu20.04]

Compile calls:

# Baseline (works)  
model = model.cuda()  
model = torch.compile(model, backend="inductor", mode="default")  

# Problematic: NaNs  
model = model.cuda()  
model = torch.compile(model, backend="inductor", mode="reduce-overhead")  

# Alternative backend: OOM  
model = model.cuda()  
model = torch.compile(model, backend="cudagraphs", mode="default")  


Behavior with different compile settings

On our large distributed setup, I observe the following:

  • Eager (no torch.compile): no NaNs, runs fine.

  • torch.compile(..., backend="inductor", mode="default"): no NaNs, runs fine.

  • torch.compile(..., backend="inductor", mode="reduce-overhead"):
    NaNs appear in the outputs / loss after some training steps.

  • torch.compile(..., backend="cudagraphs", mode="default"):
    No NaNs, but CUDA OOM due to CUDA Graph private pools.

On a single machine / small scale, I have not been able to reproduce the NaNs;
they only appear at large scale (many machines / GPUs).

NaNs with inductor + mode="reduce-overhead"

With

model = torch.compile(model, backend="inductor", mode="reduce-overhead")

the model eventually produces NaNs (in activations / loss). This does not
happen in eager mode or with mode="default".

Debugging is very difficult because:

  • The model is extremely deep and modular; the first NaN can appear far away
    from the final loss.

  • If I add print, assert, or other Python-side debug checks inside the
    model to try to detect where NaNs first appear, these debug statements often
    cause graph breaks, which change the compiled graph. After that, the
    NaNs usually disappear and the run becomes numerically stable again.

  • Because the issue only reproduces on a large-scale distributed run, and any
    intrusive debugging tends to break the graph and “fix” the issue, I have not
    been able to extract a small, self-contained repro.

So effectively, on the real system I see:

  • eager: OK

  • backend="inductor", mode="default": OK

  • backend="inductor", mode="reduce-overhead": NaNs

  • adding debug prints / asserts: often introduces graph breaks and the NaNs go away

Questions (NaN / reduce-overhead side)

  1. Are there known numerical issues or limitations specific to
    mode="reduce-overhead" in the inductor backend, especially for very large
    models or large-scale distributed runs?

  2. Is there any recommended way to debug NaNs under mode="reduce-overhead"
    without causing graph breaks? For example:

    • built-in instrumentation / logging options,

    • debug flags for inductor / dynamo,

    • ways to selectively disable certain fusions or transformations that are
      more likely to be numerically risky.

  3. Are there configuration flags I can try (env vars or torch._dynamo /
    torch._inductor settings) to make inductor more conservative in
    reduce-overhead mode and potentially avoid the NaNs?

OOM with backend="cudagraphs"

To avoid potential fusion-related numerical issues, I tried using:

model = torch.compile(model, backend="cudagraphs", mode="default")

My understanding is that this backend keeps the original eager kernels and
wraps them in CUDA Graphs, i.e. “CUDA Graphs only, no extra fusion”.

However, with this backend I quickly hit CUDA OOM on the same large-scale
setup (with the same model and batch size that works under backend="inductor").

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.38 GiB.
GPU 0 has a total capacity of 44.53 GiB of which 403.94 MiB is free.
Process XXXXX has 44.08 GiB memory in use.
Of the allocated memory 30.55 GiB is allocated by PyTorch, with 12.89 GiB
allocated in private pools (e.g., CUDA Graphs), and 12.99 GiB is reserved
by PyTorch but unallocated.
If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

is there a recommended “middle ground” configuration to:

  • use CUDA Graphs with some limited fusion (but avoid more aggressive or
    numerically risky optimizations), or

  • run inductor in a more conservative configuration that avoids both NaNs
    and excessive memory usage?