[Solved] RTX 5090 (sm_120) Training Segfault - DDP Was the Cause

Hi everyone,

I spent a few weeks debugging training crashes on my RTX 5090 and finally figured it out — sharing in case it helps anyone with the same GPU.

Setup

GPU: RTX 5090 Laptop (GB203, sm_120, 24GB VRAM)
OS: Windows 11
Framework: Style-Bert-VITS2 (VITS-based TTS model, DDP enabled by default)

Symptom

Inference worked perfectly, but training would segfault (0xC0000005 / ACCESS_VIOLATION) within seconds — no Python traceback, just a silent OS-level kill. I narrowed it down using checkpoint logging (print(flush=True) + file logging, since segfault kills stdout buffers):

CP7j: about to loss_slm.backward()    ✅
CP7k: backward DONE                   ❌ (never reached)

The crash happened during backward() — specifically in gradient synchronization. Forward pass for all models (Generator, Discriminator, WavLM) completed fine every time.

I tried everything I could think of: disabling GradScaler, pure fp32 mode, CUDA_LAUNCH_BLOCKING=1, disabling WavLM loss — nothing helped. Even with WavLM disabled, the regular Discriminator backward also segfaulted at the same stage.

What I Tried — Two Paths

Path 1: PyTorch stable (2.4.1+cu121)

sm_120 native kernels don’t exist in stable, so PyTorch falls back to PTX JIT compilation. This actually works for basic operations — you get a UserWarning but execution continues:

UserWarning: NVIDIA GeForce RTX 5090 Laptop GPU with CUDA capability sm_120
is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 ... sm_90.

However, DDP’s broadcast_coalesced immediately fails:

RuntimeError: CUDA error: no kernel image is available for execution on the device

After patching DDP out (conditional skip when world_size == 1), I hit a cascade of no kernel image errors on progressively basic operations:

1. torch.nn.functional.embedding (emb_g)  → "no kernel image" ❌
2. torch.nn.functional.conv1d (bert_proj)  → "no kernel image" ❌

Each one could be individually patched with CPU fallback, but the cascade kept going — PTX JIT fallback doesn’t cover enough ops for stable training. I gave up on this path.

Path 2: PyTorch nightly (2.12.0.dev+cu128)

Nightly has sm_120 native kernels, so forward pass works without any warnings or fallbacks. But training segfaults during backward — and here’s where it gets interesting.

First attempt — just setting CUDA_VISIBLE_DEVICES=0 (no code changes):

set CUDA_VISIBLE_DEVICES=0
python train_ms_jp_extra.py ...
→ Same segfault. The training script still calls DDP(model) unconditionally.

Second attempt — patching the actual DDP wrapping out of the code:

→ Training runs stable ✅  (1000+ steps, no crash)

The environment variable alone wasn’t enough — the training script still wraps models with DDP() regardless of how many GPUs are visible. The fix required changing the source code to skip DDP() wrapping when world_size == 1.

Interesting finding: On stable, the absence of native kernels triggers PTX JIT fallback, which at least gives you a warning and tries to continue. On nightly, native kernels exist but the DDP/NCCL communication path is broken for sm_120 — and since native kernels exist, PyTorch uses them instead of falling back to PTX, leading to a harder crash (segfault with no Python traceback).

The Fix (2 patches to the training script)

Patch 1: Skip DDP wrapping on single GPU

# Before (applied to ALL DDP-wrapped models):
net_g = DDP(net_g, device_ids=[rank])
net_d = DDP(net_d, device_ids=[rank])
net_dur_disc = DDP(net_dur_disc, device_ids=[rank])
net_wd = DDP(net_wd, device_ids=[rank])

# After:
if torch.distributed.get_world_size() > 1:
    net_g = DDP(net_g, device_ids=[rank])
    net_d = DDP(net_d, device_ids=[rank])
    net_dur_disc = DDP(net_dur_disc, device_ids=[rank])
    net_wd = DDP(net_wd, device_ids=[rank])

Patch 2: Remove .module access

Without DDP wrapping, the model isn’t wrapped in a DistributedDataParallel container, so .module doesn’t exist. Every .module access needs to be changed to direct access:

# Before:
net_g.module.use_noise_scaled_mas
net_g.module.mas_noise_scale_initial
net_g.module.noise_scale_delta
net_g.module.current_mas_noise_scale
generator.module.infer(...)

# After:
net_g.use_noise_scaled_mas
net_g.mas_noise_scale_initial
net_g.noise_scale_delta
net_g.current_mas_noise_scale
generator.infer(...)

Important: You also need to set CUDA_VISIBLE_DEVICES=0 when launching training, to ensure the distributed backend initializes with a single process.

That’s it. Two patches + one environment variable.

Results

Batch size 8: 1.10–1.47 it/s
VRAM: 18.5GB / 24GB
Stable over thousands of steps — verified across multiple epochs
WavLM discriminator loss: works normally (was NOT the cause)

I initially suspected WavLM backward was the problem (it was the first backward to crash), but after removing DDP, WavLM backward runs fine. The segfault was always DDP’s gradient synchronization, not any specific model component.

Why This Only Affects Certain Frameworks

Other frameworks I tested on the same GPU worked fine from the start:

Framework DDP Result
GPT-SoVITS v2/v4/v2Pro+ Not used :white_check_mark: Training works
Applio RVC Not used :white_check_mark: 200 epochs completed
kohya_ss (SD training) Accelerate auto-skips on single GPU :white_check_mark: Training works
Style-Bert-VITS2 Enabled by default :cross_mark: Segfault

The pattern is clear: frameworks that don’t use DDP (or auto-skip it on single GPU) work fine on sm_120. DDP + sm_120 is the specific combination that crashes.

Environment

PyTorch: 2.12.0.dev (nightly) + cu128
Launch: set CUDA_VISIBLE_DEVICES=0
Patches: DDP conditional skip + .module removal

Hope this saves someone some debugging time. If anyone has more info on DDP behavior on Blackwell GPUs, I’d love to hear about it.

1 Like

Blackwell GPUs require CUDA 12.8+. All of our stable releases starting with PyTorch 2.7.0 use CUDA 12.8 or newer and already support Blackwell. The tagged version won’t work since it’s built with CUDA 12.1. Even if you would depend on CUDA JIT compilation from PTX it’s unnecessary since we ship binaries with Blackwell support for almost a year now (since Blackwell was launched).

This seems like a code issue but this should also not cause any issues if a single GPU is installed as the DDP wrapper will be a no-op in this case.

How many GPUs do you have and which architectures?

Hi ptrblck, thanks for the reply. I apologize for the framing — I didn’t know single-GPU DDP was supposed to be a no-op.
Going forward, if I encounter a segfault in training, I’ll check for unconditional DDP() wrapping first and remove it when world_size == 1. That resolved it for me.
Thanks again for the guidance.

Removing unnecessary code sounds valid, but it’s still unexpected to see a segfault. Do you have multiple GPUs installed and if so which architectures?

Just one GPU — it’s a laptop (Intel Core Ultra 9 275HX + RTX 5090 Laptop).

dGPU: RTX 5090 Laptop (GB203, sm_120, 24GB) — only CUDA device
iGPU: Intel Graphics (not CUDA-capable)
nvidia-smi: 1 device
CUDA_VISIBLE_DEVICES=0 set during all tests
torch.cuda.device_count() = 1

I also captured nvidia-smi at 100ms intervals during both tests. Same environment (nightly 2.12.0.dev+cu128), only difference is DDP wrapping:

DDP ON (crash):

21:41:30   Model loading              1597 MHz    11W     VRAM: 0 → 1686 MiB
21:41:34   Clock drop → idle gap      22 MHz      8W      VRAM: 1706 MiB
21:41:45   DDP/NCCL phase begins      232 MHz     41W     VRAM: 1948 MiB
21:41:47   Memory climbing            232 MHz     105W    VRAM: 4000 → 8730 MiB
21:41:57   Peak                       232 MHz     119W    VRAM: 12930 MiB
21:41:59   CRASH (segfault)           22 MHz      8W      VRAM: → 0 MiB

DDP OFF (normal):

12:22:27   Model loading              1597 MHz    26W     VRAM: 0 → 1228 MiB
12:22:40   Model→GPU transfer         232 MHz     108W    VRAM: 1400 → 8730 MiB
12:22:49   Training begins            232 MHz     173W    VRAM: → 24042 MiB
           Epoch 1 completes normally

compute-sanitizer --tool memcheck: ERROR SUMMARY: 0 errors + process didn't terminate successfully (3/3 crash runs)

1 Like