Undefined symbol: _ZN3c1017RegisterOperatorsD1Ev

Hello,

I recently updated my pytorch to 2.2.1+cu121 using pip, then install two packages torch-sparse and torch-scatter as follows:
pip install torch-sparse
pip install torch-scatter,
then the bug is reported as:
/lib/python3.10/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev.

Someone mentioned this is because multiple version are installed, then:
I repeatedly call pip uninstall torch-sparse , pip uninstall torch-scatter until it returns:
WARNING: Ignoring invalid distribution -orch (/lib/python3.10/site-packages)
WARNING: Skipping torch-sparse as it is not installed.

Then pip install torch-sparse , the bug keeps:
/lib/python3.10/site-packages/torch_sparse/_version_cuda.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev . (same for torch_scatter)

I am wondering how to solve this issue. Thanks!

The torch-sparse installation might be mismatched with the PyTorch installation. Make sure to install compatible versions based on the release notes from torch-sparse.

Thanks for your reply. I have successfully solved the issue. However, I met another one, it didn’t happen before updating pytorch,

CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

then I used os.environ['CUDA_LAUNCH_BLOCKING'] = "1",it returns:

CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I haven’t changed my code but only updated the pytorch, I can also make sure that GPU memory is sufficient, could you please give me some suggestions? Thanks.

You could rerun your code via compute-sanitizer python script.py args to check which kernel is causing the memory violation and post the beginning of the logs here, assuming an error is indeed detected by compute-sanitizer.

Hello,

This is the return:

COMPUTE-SANITIZER

========= Error: No attachable process found. compute-sanitizer timed-out.

========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

I have tried to adjust the --launch-timeout, same logs returned.

I added

 import os

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

Then I found this happened in torch-sparse:

RuntimeError                              Traceback (most recent call last)
Input [In [18]] in <cell line: 3>()
      [7]model.reset_parameters()
      [8]for epoch in range(args.epochs):
----> [9]  loss = train(data)


Input [In [16]], in train(data)
     [19]y = batch1.y[:batch1.batch_size][train].to(device)
     [20]
---> [21]out = model(x1, adj_t1, id1, batch1.batch_size, args.K_train, args.alpha)[:batch1.batch_size][train]
     [22] loss = F.nll_loss(out, y)

File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1511](/lib/python3.10/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509]    return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510] else:
-> [1511]     return self._call_impl(*args, **kwargs)

File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1520](/lib/python3.10/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515]# If we don't have any hooks, we want to skip the rest of the logic in
   [1516] # this function, and just call forward.
   [1517] if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518]  or _global_backward_pre_hooks or _global_backward_hooks
   [1519]        or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520]     return forward_call(*args, **kwargs)
   [1522] try:
   [1523]    result = None

Input [In [14]] in Net.forward(self, x, adj, id, size, K, alpha)
     [37] z = x.clone()
     [38]for i in range(K-1):
---> [39]    z = (1 - alpha) * (adj @ z) + alpha * x


File [/lib/python3.10/site-packages/torch_sparse/matmul.py:171), in <lambda>(self, other)
    [167]SparseTensor.spspmm = lambda self, other, reduce="sum": spspmm(
    [168]    self, other, reduce)
    [169]SparseTensor.matmul = lambda self, other, reduce="sum": matmul(
    [170]     self, other, reduce)
--> [171]SparseTensor.__matmul__ = lambda self, other: matmul(self, other, 'sum')

File [/lib/python3.10/site-packages/torch_sparse/matmul.py:160), in matmul(src, other, reduce)
    [142] """Matrix product of a sparse tensor with either another sparse tensor or a
    [143] dense tensor. The sparse tensor represents an adjacency matrix and is
    [144] stored as a list of edges. This method multiplies elements along the rows
   (...)
    [157] :rtype: (:class:`Tensor`)
    [158] """
    [159] if isinstance(other, torch.Tensor):
--> [160]     return spmm(src, other, reduce)
    [161] elif isinstance(other, SparseTensor):
    [162]     return spspmm(src, other, reduce)

File [/lib/python3.10/site-packages/torch_sparse/matmul.py:83), in spmm(src, other, reduce)
     [79] def spmm(src: SparseTensor,
     [80]         other: torch.Tensor,
     [81]        reduce: str = "sum") -> torch.Tensor:
     [82]     if reduce == 'sum' or reduce == 'add':
---> [83]        return spmm_sum(src, other)
     [84]     elif reduce == 'mean':
     [85]         return spmm_mean(src, other)

File [/lib/python3.10/site-packages/torch_sparse/matmul.py:24), in spmm_sum(src, other)
     [22] if other.requires_grad:
     [23]     row = src.storage.row()
---> [24]    csr2csc = src.storage.csr2csc()
     [25]     colptr = src.storage.colptr()
     [27] return torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr,
     [28]                                       csr2csc, other)

File [/lib/python3.10/site-packages/torch_sparse/storage.py:412), in SparseStorage.csr2csc(self)
    [409] if csr2csc is not None:
    [410]     return csr2csc
--> [412] idx = self._sparse_sizes[0] * self._col + self.row()
    [413] max_value = self._sparse_sizes[0] * self._sparse_sizes[1]
    [414] _, csr2csc = index_sort(idx, max_value)

I am not sure this is related to torch-sparse or pytorch directly, there is a similar discussion:
https://github.com/rusty1s/pytorch_sparse/issues/314
I failed at a simple script:

import torch

A = torch.randn(5, 5).to_sparse().cuda()
torch.sparse.mm(A, A)

same bug reported:

CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I cannot reproduce the issue using the simple script and get:

tensor(indices=tensor([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3,
                        3, 4, 4, 4, 4, 4],
                       [0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 0, 1, 2, 3,
                        4, 0, 1, 2, 3, 4]]),
       values=tensor([ 2.2501, -0.0442,  1.2576,  0.3120, -1.9511,  1.8506,
                       2.0691, -0.5324, -0.2164, -0.3505,  4.5437,  2.7143,
                       4.5200, -0.3993, -7.1216,  0.0259, -0.9292,  0.6680,
                       0.8927, -2.2475, -1.5833,  1.2326, -0.1331,  1.5515,
                       2.1803]),
       device='cuda:0', size=(5, 5), nnz=25, layout=torch.sparse_coo)

If you are still seeing an illegal memory access with it, does compute-sanitizer work on this script?

Hello ptrblck,

Sorry for the confusion. I can also get the same results successfully using that simple script. I wrote another very simple code script as below which can reproduce the issue:

import argparse
import os.path as osp
from typing import Tuple
import numpy as np
import time
import copy
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import Tensor
from torch.nn import Linear
import scipy.sparse as sp
import torch_geometric
import torch_geometric.transforms as T
from torch_geometric.datasets import Planetoid
from torch_geometric.logging import init_wandb, log
from torch_geometric.utils import to_undirected
from torch_geometric.loader import DataLoader
from torch_geometric.loader import NeighborLoader
from ogb.nodeproppred import PygNodePropPredDataset, Evaluator
from torch_geometric.nn import GCNConv
from torch_geometric.nn.conv.gcn_conv import gcn_norm

def index2mask(idx: Tensor, size: int) -> Tensor:
    mask = torch.zeros(size, dtype=torch.bool, device=idx.device)
    mask[idx] = True
    return mask

def gen_masks(y: Tensor, train_per_class: int = 20, val_per_class: int = 30,
              num_splits: int = 20) -> Tuple[Tensor, Tensor, Tensor]:
    num_classes = int(y.max()) + 1

    train_mask = torch.zeros(y.size(0), num_splits, dtype=torch.bool)
    val_mask = torch.zeros(y.size(0), num_splits, dtype=torch.bool)

    for c in range(num_classes):
        idx = (y == c).nonzero(as_tuple=False).view(-1)
        perm = torch.stack(
            [torch.randperm(idx.size(0)) for _ in range(num_splits)], dim=1)
        idx = idx[perm]

        train_idx = idx[:train_per_class]
        train_mask.scatter_(0, train_idx, True)
        val_idx = idx[train_per_class:train_per_class + val_per_class]
        val_mask.scatter_(0, val_idx, True)

    test_mask = ~(train_mask | val_mask)

    return train_mask, val_mask, test_mask

def get_arxiv():
    root='/tmp/datasets'
    dataset = PygNodePropPredDataset('ogbn-arxiv', f'{root}/OGB',
                                     pre_transform=T.ToSparseTensor())
    data = dataset[0]
    data.adj_t = data.adj_t.to_symmetric()
    data.node_year = None
    data.y = data.y.view(-1)
    split_idx = dataset.get_idx_split()
    data.train_mask = index2mask(split_idx['train'], data.num_nodes)
    data.val_mask = index2mask(split_idx['valid'], data.num_nodes)
    data.test_mask = index2mask(split_idx['test'], data.num_nodes)
    return data, dataset.num_features, dataset.num_classes

data, in_channels, out_channels = get_arxiv()
dataset = PygNodePropPredDataset(name='ogbn-arxiv')
device = torch.device("cuda:2" if torch.cuda.is_available() else "cpu")

data.adj_t = data.adj_t.set_diag()
data.adj_t = gcn_norm(data.adj_t, add_self_loops=False)
data.n_id = torch.arange(data.num_nodes)

parser = argparse.ArgumentParser()
parser.add_argument('--runs', type=int, default=1)
parser.add_argument('--epochs', type=int, default=2000)
parser.add_argument('--lr', type=float, default=0.01)
parser.add_argument('--weight_decay', type=float, default=0)
parser.add_argument('--early_stopping', type=int, default=0)
parser.add_argument('--hidden', type=int, default=256)
parser.add_argument('--num_layers', type=int, default=3)
parser.add_argument('--dropout', type=float, default=0.5)
parser.add_argument('--normalize_features', action='store_true')
args = parser.parse_args(args=[])

class Net(nn.Module):
    def __init__(self, num_features, hidden_channels, num_classes, num_layers, num_nodes, **kwargs):
        super(Net, self).__init__()
        
        self.convs = torch.nn.ModuleList()
        self.convs.append(GCNConv(num_features, hidden_channels, normalize=False))
        for _ in range(num_layers - 2):
            self.convs.append(GCNConv(hidden_channels, hidden_channels, normalize=False))
        self.convs.append(GCNConv(hidden_channels, num_classes, normalize=False))
        self.num_classes = num_classes
        self.num_nodes = num_nodes
        self.hidden_channels = hidden_channels

    def reset_parameters(self):
        for conv in self.convs:
            conv.reset_parameters()

    def forward(self):
        data_z = self.convs[0](data.x.to(device), data.adj_t.to(device))
        return data_z

model = Net(data.x.shape[1], args.hidden, dataset.num_classes, args.num_layers, data.num_nodes)
model = model.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=args.lr, weight_decay=args.weight_decay)

def train(data):
    model.train()

    neigh_out = model()

    return None

acc = []
best = 0
for j in range(args.runs):
    tr = []
    val_accs = []
    test_accs = []
    for epoch in range(args.epochs):
        loss = train(data)

It returns same bugs for me:

CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I am wondering if you could please take a look at it. Thanks in advance.

Again, run compute-sanitizer on it as based on the current discussion it seems you might randomly run into issues and PyTorch might just release them (e.g. a nonfunctioning driver installation).

Thanks. I tried it, it returns the same:

COMPUTE-SANITIZER

========= Error: No attachable process found. compute-sanitizer timed-out.

========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.

However, I found this bug will not happen if i set cuda:0, but happens on all other GPUs. And the bug appears after several epochs instead of the beginning of the training. I am wondering what I should do to solve it?

Hi, I met the same issue with CUDA 12.4 and torch-scatter 2.1.2. How did you fix it? Thanks!

/lib/python3.12/site-packages/torch_scatter/_version_cpu.so: undefined symbol: _ZN3c1017RegisterOperatorsD1Ev

Hi, any luck here? I have the same problem. It says that torch-scatter should be same version as torch. Last version of torch is 2.2.2 and last version of torch-scatter is 2.1.2. When I downgrade torch to 2.1.2 then even at the imports the kernel dies…

Any help would be much apprececiated!