Segmentation fault when calling .backward() after moving data to GPU (PyTorch + CUDA 12.1)

Hi everyone,
I’m running into a segmentation fault (core dumped) error while training a model using PyTorch on a CUDA-enabled GPU.
I’m not sure what’s going wrong, and would really appreciate any guidance.

My Environment
GPU: 2× NVIDIA GeForce RTX 4060 Ti
Driver Version: 550.120
CUDA Version (Driver-side): 12.4
cuDNN Version: 8902
PyTorch Version: 2.2.0+cu121
Python: 3.10.12
CUDA available: True
Detected CUDA from PyTorch: 12.1
Host OS: Ubuntu 24.04
Docker Image: nvidia/cuda:12.4.1-runtime-ubuntu22.04
Kernel: Linux 6.8.0-55-generic x86_64 with glibc 2.35
Running inside Docker: Yes

The Problem
During training, the script suddenly crashes with a segmentation fault.
The crash does not happen at a specific line every time — sometimes it happens in .backward(), sometimes while creating a tensor on GPU using .to(device).
It usually occurs after a few training batches, not at the very beginning.

Here’s a simplified version of the code:

def train_test_ht_sl(model, train_data, test_data, head_list, tail_list):
    import datetime
    model.scheduler.step()
    print('start training: ', datetime.datetime.now())
    model.train()
    total_loss = 0.0
    slices = train_data.generate_batch(model.batch_size)

    for i, j in zip(slices, np.arange(len(slices))):
        model.optimizer.zero_grad()
        targets, scores = forward(model, i, train_data)

        # targets = torch.from_numpy(np.array(targets)).long().to('cuda:1')
        targets = torch.tensor(targets).long().to(device)
        loss = model.loss_function(scores, targets - 1)
        loss.backward()  # <- crash sometimes happens here
def forward(model, i, data):
    alias_inputs, A, items, mask, targets = data.get_slice(i)

    alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
    items = torch.tensor(items, dtype=torch.long, device=device)
    mask = torch.tensor(mask, dtype=torch.long, device=device)

    A_np = np.stack(A)
    A = torch.tensor(A_np, dtype=torch.float, device=device)  # <- or here

    hidden = model(items, A)

    get = lambda i: hidden[i][alias_inputs[i]]
    seq_hidden = torch.stack([get(i) for i in torch.arange(len(alias_inputs)).long()])

    return targets, model.compute_scores(seq_hidden, mask)

Error excerpt

Training Progress:  20%|██        | 6/30 [16:14<1:04:58, 162.43s/it]
Fatal Python error: Segmentation fault

Current thread 0x00007... (most recent call first):
  <no Python frame>

Thread 0x00007...:
  File "/usr/lib/python3.10/threading.py", line 324 in wait
  ...
  File "/usr/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266 in backward

My Question
I’m still new to CUDA programming and PyTorch internals, so I’m not sure:
Why might this segmentation fault occur?
Am I doing something wrong when moving data to the GPU?
Is there a safer or more proper way to handle tensors before calling .backward()?
Any help or explanation would be really appreciated. Thank you in advance!

Arw you seeing the same issue using the latest stable or nightly release?

I’m currently using the latest stable release: PyTorch 2.2.0 with CUDA 12.1. I haven’t tried the nightly version yet, but I’ll test it and let you know if the issue still occurs there as well.

Thanks again!

The latest stable release is PyTorch 2.6.0 with CUDA 12.4. 2.7.0 with CUDA 12.6 is currently in the last verification steps and nightly binaries are pre-2.8.0 with CUDA 12.8.

I installed PyTorch 2.6.0 with CUDA 12.4 as recommended, but I’m still getting the same error. Here is the error message that I got.

error message:
1]+ Segmentation fault (core dumped)

from dmesg log
[ 1220.023326] pt_autograd_0[13641]: segfault at 52a281 ip 00007d77d97f21b6 sp 00007d773bdfb538 error 6 in libcuda.so.550.120[7d77d96de000+4d5000] likely on CPU 9 (core 16, socket 0)

I minimized the training script file to reproduce the error. I share the script file for your reference. Thank you

import torch
import torch.nn as nn
import numpy as np
import math
import time
import traceback
from tqdm import tqdm

#device = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")
device = torch.device("cuda:0")
print(f"[Info] Using device: {device}", flush=True)

# Dummy Config class
class DummyOpt:
    def __init__(self):
        self.hiddenSize = 100
        self.step = 1
        self.batchSize = 100
        self.nonhybrid = False
        self.lr = 0.001
        self.l2 = 1e-5
        self.lr_dc_step = 3
        self.lr_dc = 0.1

# GNN model definition (can be skipped if imported above)
class GNN(nn.Module):
    def __init__(self, hidden_size, step=1):
        super(GNN, self).__init__()
        self.step = step
        self.hidden_size = hidden_size
        self.linear_edge_in = nn.Linear(hidden_size, hidden_size)
        self.linear_edge_out = nn.Linear(hidden_size, hidden_size)
        self.w_ih = nn.Linear(hidden_size * 2, hidden_size * 3)
        self.w_hh = nn.Linear(hidden_size, hidden_size * 3)

    def GNNCell(self, A, hidden):
        input_in = torch.matmul(A[:, :, :A.shape[1]], self.linear_edge_in(hidden))
        input_out = torch.matmul(A[:, :, A.shape[1]:], self.linear_edge_out(hidden))
        inputs = torch.cat([input_in, input_out], dim=2)

        gi = self.w_ih(inputs)
        gh = self.w_hh(hidden)
        i_r, i_i, i_n = gi.chunk(3, dim=2)
        h_r, h_i, h_n = gh.chunk(3, dim=2)

        resetgate = torch.sigmoid(i_r + h_r)
        inputgate = torch.sigmoid(i_i + h_i)
        newgate = torch.tanh(i_n + resetgate * h_n)
        hy = newgate + inputgate * (hidden - newgate)
        return hy

    def forward(self, A, hidden):
        for _ in range(self.step):
            hidden = self.GNNCell(A, hidden)
        return hidden

class SessionGraph(nn.Module):
    def __init__(self, opt, n_node):
        super(SessionGraph, self).__init__()
        self.hidden_size = opt.hiddenSize
        self.batch_size = opt.batchSize
        self.nonhybrid = opt.nonhybrid
        self.embedding = nn.Embedding(n_node, self.hidden_size)
        self.gnn = GNN(self.hidden_size, step=opt.step)
        self.linear_one = nn.Linear(self.hidden_size, self.hidden_size)
        self.linear_two = nn.Linear(self.hidden_size, self.hidden_size)
        self.linear_three = nn.Linear(self.hidden_size, 1)
        self.linear_transform = nn.Linear(self.hidden_size * 2, self.hidden_size)

    def compute_scores(self, hidden, mask):
        ht = hidden[torch.arange(mask.shape[0]), torch.sum(mask, 1) - 1]
        q1 = self.linear_one(ht).unsqueeze(1)
        q2 = self.linear_two(hidden)
        alpha = self.linear_three(torch.sigmoid(q1 + q2))
        a = torch.sum(alpha * hidden * mask.unsqueeze(-1).float(), dim=1)
        if not self.nonhybrid:
            a = self.linear_transform(torch.cat([a, ht], dim=1))
        b = self.embedding.weight[1:]  # exclude padding idx
        scores = torch.matmul(a, b.transpose(1, 0))
        return scores

    def forward(self, inputs, A):
        hidden = self.embedding(inputs)
        hidden = self.gnn(A, hidden)
        return hidden

# Generate dummy input
def generate_dummy_data(batch_size, seq_len, n_node):
    alias_inputs = np.tile(np.arange(seq_len), (batch_size, 1))  # (batch, seq)
    A = np.random.rand(batch_size, seq_len, seq_len * 2)  # shape: 100 * 10 * 20 (20000)
    items = np.random.randint(1, n_node, size=(batch_size, seq_len))
    mask = (items != 0).astype(int)
    targets = np.random.randint(1, n_node, size=(batch_size,))
    return alias_inputs, A, items, mask, targets

# Main loop
def run_dummy_loop():
    opt = DummyOpt()
    n_node = 1000
    model = SessionGraph(opt, n_node).to(device)
    model.train()

    num_iterations = 5000
    seq_len = 10

    for i in range(num_iterations):
        try:
            ##########################################
            # Create variables on CPU
            alias_inputs, A, items, mask, targets = generate_dummy_data(opt.batchSize, seq_len, n_node)

            ##########################################
            # Move variables from CPU to GPU
            alias_inputs = torch.tensor(alias_inputs, dtype=torch.long, device=device)
            A = torch.tensor(A, dtype=torch.float32, device=device)
            items = torch.tensor(items, dtype=torch.long, device=device)
            mask = torch.tensor(mask, dtype=torch.long, device=device)
            targets = torch.tensor(targets, dtype=torch.long, device=device)

            # Check for NaNs or Infs
            assert not torch.isnan(targets).any(), "Targets contain NaNs"
            assert not torch.isinf(targets).any(), "Targets contain Infs"

            ##########################################
            # GPU-side computation (GNN message passing)
            hidden = model(items, A)

            # Reorder sequence using loop + indexing
            seq_hidden = torch.stack([hidden[i][alias_inputs[i]] for i in range(alias_inputs.shape[0])])

            # Reorder sequence using gather -> No error initially, but got illegal instruction at loop 97
            #alias_idx = alias_inputs.unsqueeze(-1).expand(-1, -1, hidden.size(2))  # (batch, seq_len, hidden_size)
            #seq_hidden = torch.gather(hidden, dim=1, index=alias_idx)

            ##########################################
            # GPU-side computation (GNN message passing)
            scores = model.compute_scores(seq_hidden, mask)

            torch.cuda.synchronize()

            ##########################################
            # GPU-side computation (prediction and loss update)
            loss_fn = nn.CrossEntropyLoss()
            loss = loss_fn(scores, targets - 1)

            assert not torch.isnan(loss).any(), "Loss contains NaNs"

            loss.backward()

            ##########################################
            # Move prediction values back to CPU

            # Test NumPy conversion
            result_np = loss.detach().cpu().numpy()  # changed from loss.item() but still causes error
            _ = np.log(np.clip(result_np, 1e-8, None))

            torch.cuda.synchronize()

            if i % 1000 == 0:
                print(f"[{i}] Loss: {result_np:.6f}", flush=True)

        except Exception as e:
            print(f"\nException at iteration {i}: {traceback.format_exc()}", flush=True)
            break

for i in tqdm(range(0, 100), desc='progress'):
    print(f'loop {i}th')
    run_dummy_loop()

@ptrblck hi sir, any feedback would be really helpful. Thank you

I cannot reproduce the issue using torch==2.6.0 and see:

python tmp.py 
[Info] Using device: cuda:0
progress:   0%|                                                                                                                                                                            | 0/100 [00:00<?, ?it/s]loop 0th
[0] Loss: 14.333693
[1000] Loss: 13.637185
[2000] Loss: 13.238361
[3000] Loss: 14.255023
[4000] Loss: 13.555018
progress:   1%|█▌                                                                                                                                                                | 1/100 [00:46<1:17:21, 46.88s/it]loop 1th
[0] Loss: 13.685474
[1000] Loss: 13.187521
...

before stopping the job.