Hi,
I’m running into two related issues when trying to speed up a very large model with torch.compile and CUDA Graphs:
-
NaNs when using
backend="inductor", mode="reduce-overhead" -
OOM when using
backend="cudagraphs"to avoid fusion
Because of company policy and the scale of the system, I cannot share the full model code or a minimal repro script (details at the end), so I’m mainly looking for guidance / best practices and to understand whether what I’m seeing matches known limitations or bugs.
1. Setup
-
Model: very large, deeply stacked model (thousands of small custom modules), total model code is O(10,000+) lines.
-
Training setup: large-scale distributed training (needs dozens of machines / GPUs to reproduce the issue; on a single machine the problem does NOT show up).
-
Precision: [mixed precision,FP16]
-
PyTorch version: [2.4.0]
-
CUDA version: [12.1]
-
GPUs: 8rank(L20) with 40+nodes
-
OS: [Linux ubuntu20.04]
Compile calls:
# Baseline (works)
model = model.cuda()
model = torch.compile(model, backend="inductor", mode="default")
# Problematic: NaNs
model = model.cuda()
model = torch.compile(model, backend="inductor", mode="reduce-overhead")
# Alternative backend: OOM
model = model.cuda()
model = torch.compile(model, backend="cudagraphs", mode="default")
Behavior with different compile settings
On our large distributed setup, I observe the following:
-
Eager (no
torch.compile): no NaNs, runs fine. -
torch.compile(..., backend="inductor", mode="default"): no NaNs, runs fine. -
torch.compile(..., backend="inductor", mode="reduce-overhead"):
NaNs appear in the outputs / loss after some training steps. -
torch.compile(..., backend="cudagraphs", mode="default"):
No NaNs, but CUDA OOM due to CUDA Graph private pools.
On a single machine / small scale, I have not been able to reproduce the NaNs;
they only appear at large scale (many machines / GPUs).
NaNs with inductor + mode="reduce-overhead"
With
model = torch.compile(model, backend="inductor", mode="reduce-overhead")
the model eventually produces NaNs (in activations / loss). This does not
happen in eager mode or with mode="default".
Debugging is very difficult because:
-
The model is extremely deep and modular; the first NaN can appear far away
from the final loss. -
If I add
print,assert, or other Python-side debug checks inside the
model to try to detect where NaNs first appear, these debug statements often
cause graph breaks, which change the compiled graph. After that, the
NaNs usually disappear and the run becomes numerically stable again. -
Because the issue only reproduces on a large-scale distributed run, and any
intrusive debugging tends to break the graph and “fix” the issue, I have not
been able to extract a small, self-contained repro.
So effectively, on the real system I see:
-
eager: OK
-
backend="inductor", mode="default": OK -
backend="inductor", mode="reduce-overhead": NaNs -
adding debug prints / asserts: often introduces graph breaks and the NaNs go away
Questions (NaN / reduce-overhead side)
-
Are there known numerical issues or limitations specific to
mode="reduce-overhead"in the inductor backend, especially for very large
models or large-scale distributed runs? -
Is there any recommended way to debug NaNs under
mode="reduce-overhead"
without causing graph breaks? For example:-
built-in instrumentation / logging options,
-
debug flags for inductor / dynamo,
-
ways to selectively disable certain fusions or transformations that are
more likely to be numerically risky.
-
-
Are there configuration flags I can try (env vars or
torch._dynamo/
torch._inductorsettings) to make inductor more conservative in
reduce-overheadmode and potentially avoid the NaNs?
OOM with backend="cudagraphs"
To avoid potential fusion-related numerical issues, I tried using:
model = torch.compile(model, backend="cudagraphs", mode="default")
My understanding is that this backend keeps the original eager kernels and
wraps them in CUDA Graphs, i.e. “CUDA Graphs only, no extra fusion”.
However, with this backend I quickly hit CUDA OOM on the same large-scale
setup (with the same model and batch size that works under backend="inductor").
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.38 GiB.
GPU 0 has a total capacity of 44.53 GiB of which 403.94 MiB is free.
Process XXXXX has 44.08 GiB memory in use.
Of the allocated memory 30.55 GiB is allocated by PyTorch, with 12.89 GiB
allocated in private pools (e.g., CUDA Graphs), and 12.99 GiB is reserved
by PyTorch but unallocated.
If reserved but unallocated memory is large try setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
is there a recommended “middle ground” configuration to:
-
use CUDA Graphs with some limited fusion (but avoid more aggressive or
numerically risky optimizations), or -
run inductor in a more conservative configuration that avoids both NaNs
and excessive memory usage?