Cuda error when convert tensor to cpu

Haron_Wan · April 17, 2021, 2:38pm

I got this error while running GAT, can anyone help me with it? Thanks a lot.

File “/home/disk1/users/MyGat/MyGat0417/utils.py”, line 114, in _parse_and_check_input
y_true = y_true.detach().cpu().numpy()

RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of ‘dmlc::Error’
what(): [22:21:45] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:103: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: device-side assert triggered
Stack trace:
[bt] (0) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f13167772cf]
[bt] (1) /home/disk1/guoweifeng/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::FreeDataSpace(DLContext, void*)+0x15c) [0x7f1316fc44ec]
[bt] (2) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Internal::DefaultDeleter(dgl::runtime::NDArray::Container*)+0x1ad) [0x7f1316e824fd]
[bt] (3) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::UnitGraph::COO::~COO()+0x127) [0x7f1316f9ea37]
[bt] (4) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::UnitGraph::~UnitGraph()+0x1ba) [0x7f1316f9e7ca]
[bt] (5) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::HeteroGraph::~HeteroGraph()+0x119) [0x7f1316e9b4e9]
[bt] (6) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(DGLObjectFree+0xb5) [0x7f1316e592c5]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f147cfa2dae]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f147cfa271f]

ptrblck · April 18, 2021, 12:37am

Could you rerun your code via CUDA_LAUNCH_BLOCKING=1 python script.py args and check the stacktrace for an error message or post it here, please?

Haron_Wan · April 18, 2021, 7:13am

I tried that CUDA_LAUNCH_BLOCKING = 1 code, and it not working. Here is the whole stack trace. Thanks a lot for your help. Let me know if you need more details.

users@amax:~$ python3 -u /home/disk1/users/MyGat/MyGat0417/gat.py --gpu=0
Using backend: pytorch
/home/disk1/users/MyGat/MyGat0417/gat.py --gpu=0
Loading dataset
done
Total edges before adding self-loop 16148
Total edges after adding self-loop 18301
running 1: 0%| | 0/200 [00:00<?, ?it/s]/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [4,0,0], thread: [52,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
running 1: 0%| | 0/200 [00:00<?, ?it/s]
Traceback (most recent call last):
File “/home/disk1/users/MyGat/MyGat0417/gat.py”, line 491, in
main()
File “/home/disk1/users/MyGat/MyGat0417/gat.py”, line 457, in main
log_file, tensorboard_writer, pred_mx)
File “/home/disk1/users/MyGat/MyGat0417/gat.py”, line 275, in run
acc = compute_acc(pred[train_idx], labels[train_idx], evaluator, pred_mx=pred_mx, pred_idx=pred_idx, train_idx=train_idx, importance=True)
File “/home/disk1/users/MyGat/MyGat0417/gat.py”, line 143, in compute_acc
result = evaluator.eval({“y_pred”: y_pred, “y_true”: y_true})
File “/home/disk1/users/MyGat/MyGat0417/utils.py”, line 144, in eval
y_true, y_pred = self._parse_and_check_input(input_dict)
File “/home/disk1/users/MyGat/MyGat0417/utils.py”, line 114, in _parse_and_check_input
y_true = y_true.detach().cpu().numpy()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of ‘dmlc::Error’
what(): [15:06:27] /opt/dgl/src/runtime/cuda/cuda_device_api.cc:103: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: device-side assert triggered
Stack trace:
[bt] (0) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x4f) [0x7f021641d2cf]
[bt] (1) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::CUDADeviceAPI::FreeDataSpace(DLContext, void*)+0x15c) [0x7f0216c6a4ec]
[bt] (2) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::runtime::NDArray::Internal::DefaultDeleter(dgl::runtime::NDArray::Container*)+0x1ad) [0x7f0216b284fd]
[bt] (3) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::UnitGraph::COO::~COO()+0x127) [0x7f0216c44a37]
[bt] (4) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::UnitGraph::~UnitGraph()+0x1ba) [0x7f0216c447ca]
[bt] (5) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(dgl::HeteroGraph::~HeteroGraph()+0x119) [0x7f0216b414e9]
[bt] (6) /home/disk1/users/.local/lib/python3.6/site-packages/dgl/libdgl.so(DGLObjectFree+0xb5) [0x7f0216aff2c5]
[bt] (7) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7f037cbc3dae]
[bt] (8) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x22f) [0x7f037cbc371f]

Aborted (core dumped)

ptrblck · April 19, 2021, 5:37am

Thanks for the update!
Running the script via CUDA_LAUNCH_BLOCKING=1 won’t “fix” the issue, but should give you a proper error message with the failing operation, which is also the case for you:

/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [4,0,0], thread: [52,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

Based on this an indexing operation fails, so you would need to check, where this indexing is used in your code and what values are used for the index, as they are out of bounds.

Harvell36 · April 23, 2021, 6:46am

Thank you for updating us with the outcome!