Hello,
This is the return:
COMPUTE-SANITIZER
========= Error: No attachable process found. compute-sanitizer timed-out.
========= Default timeout can be adjusted with --launch-timeout. Awaiting target completion.
I have tried to adjust the --launch-timeout
, same logs returned.
I added
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
Then I found this happened in torch-sparse:
RuntimeError Traceback (most recent call last)
Input [In [18]] in <cell line: 3>()
[7]model.reset_parameters()
[8]for epoch in range(args.epochs):
----> [9] loss = train(data)
Input [In [16]], in train(data)
[19]y = batch1.y[:batch1.batch_size][train].to(device)
[20]
---> [21]out = model(x1, adj_t1, id1, batch1.batch_size, args.K_train, args.alpha)[:batch1.batch_size][train]
[22] loss = F.nll_loss(out, y)
File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1511](/lib/python3.10/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
[1509] return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
[1510] else:
-> [1511] return self._call_impl(*args, **kwargs)
File [/lib/python3.10/site-packages/torch/nn/modules/module.py:1520](/lib/python3.10/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
[1515]# If we don't have any hooks, we want to skip the rest of the logic in
[1516] # this function, and just call forward.
[1517] if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
[1518] or _global_backward_pre_hooks or _global_backward_hooks
[1519] or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520] return forward_call(*args, **kwargs)
[1522] try:
[1523] result = None
Input [In [14]] in Net.forward(self, x, adj, id, size, K, alpha)
[37] z = x.clone()
[38]for i in range(K-1):
---> [39] z = (1 - alpha) * (adj @ z) + alpha * x
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:171), in <lambda>(self, other)
[167]SparseTensor.spspmm = lambda self, other, reduce="sum": spspmm(
[168] self, other, reduce)
[169]SparseTensor.matmul = lambda self, other, reduce="sum": matmul(
[170] self, other, reduce)
--> [171]SparseTensor.__matmul__ = lambda self, other: matmul(self, other, 'sum')
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:160), in matmul(src, other, reduce)
[142] """Matrix product of a sparse tensor with either another sparse tensor or a
[143] dense tensor. The sparse tensor represents an adjacency matrix and is
[144] stored as a list of edges. This method multiplies elements along the rows
(...)
[157] :rtype: (:class:`Tensor`)
[158] """
[159] if isinstance(other, torch.Tensor):
--> [160] return spmm(src, other, reduce)
[161] elif isinstance(other, SparseTensor):
[162] return spspmm(src, other, reduce)
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:83), in spmm(src, other, reduce)
[79] def spmm(src: SparseTensor,
[80] other: torch.Tensor,
[81] reduce: str = "sum") -> torch.Tensor:
[82] if reduce == 'sum' or reduce == 'add':
---> [83] return spmm_sum(src, other)
[84] elif reduce == 'mean':
[85] return spmm_mean(src, other)
File [/lib/python3.10/site-packages/torch_sparse/matmul.py:24), in spmm_sum(src, other)
[22] if other.requires_grad:
[23] row = src.storage.row()
---> [24] csr2csc = src.storage.csr2csc()
[25] colptr = src.storage.colptr()
[27] return torch.ops.torch_sparse.spmm_sum(row, rowptr, col, value, colptr,
[28] csr2csc, other)
File [/lib/python3.10/site-packages/torch_sparse/storage.py:412), in SparseStorage.csr2csc(self)
[409] if csr2csc is not None:
[410] return csr2csc
--> [412] idx = self._sparse_sizes[0] * self._col + self.row()
[413] max_value = self._sparse_sizes[0] * self._sparse_sizes[1]
[414] _, csr2csc = index_sort(idx, max_value)
I am not sure this is related to torch-sparse or pytorch directly, there is a similar discussion:
https://github.com/rusty1s/pytorch_sparse/issues/314
I failed at a simple script:
import torch
A = torch.randn(5, 5).to_sparse().cuda()
torch.sparse.mm(A, A)
same bug reported:
CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.