Hi,
I am facing a strange CUDA error in my setup, and I don’t really know what’s happening. What I have in my Python code is:
- Some PyTorch code,
- A C++ library using CUDA, with Python wrappers.
When I am using that C++ library in Python alone, it works without any issue. However, if I mix it with PyTorch, I get cudaErrorIllegalAddress: an illegal memory access was encountered in the C++ library.
import torch
import mymodule
# If I use "cpu" for the device: no error
a = torch.randn(1, 1, dtype=torch.float, device="cuda")
# Without this multiplication: no error
b = a @ a
for i in range(100000):
# After a few iterations: CUDA error raised in there if PyTorch
# is used above
obj = mymodule.MyClass()
I am using the precompiled PyTorch packages with CUDA support (e.g. 1.10.2+cu113
), but my system has its own CUDA version (11.4) and the C++ library uses it. Can it explain this kind of errors? This comment seems to suggest that it can, but this is not confirmed.
Some other pieces of information:
- I tested with
CUDA_LAUNCH_BLOCKING=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1
but the problem still happens. -
cudart
is linked statically to the C++ library (I’ll try with dynamic linking). -
cuda-memcheck
does not provide more information for the error in the C++ module, but I get this error when starting to use PyTorch:
========= Internal Memcheck Error: Initialization failed
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/lib/x86_64-linux-gnu/libcuda.so.1 [0x24bd9b]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libcudart-a7b20f20.so.11.0 [0x3127c]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libcudart-a7b20f20.so.11.0 [0x1ff4e]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libcudart-a7b20f20.so.11.0 [0x37974]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libcudart-a7b20f20.so.11.0 [0x395aa]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libcudart-a7b20f20.so.11.0 [0x2f32e]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libcudart-a7b20f20.so.11.0 [0x12268]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libcudart-a7b20f20.so.11.0 (cudaMalloc + 0x10c) [0x4a55c]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libc10_cuda.so [0x26874]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cpp.so (_ZN2at6native10empty_cudaEN3c108ArrayRefIlEENS1_8optionalINS1_10ScalarTypeEEENS4_INS1_6LayoutEEENS4_INS1_6DeviceEEENS4_IbEENS4_INS1_12MemoryFormatEEE + 0x124) [0x2d605a4]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so [0x25ab39e]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cuda_cu.so [0x25ab41a]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so [0x1d1503e]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so (_ZN2at4_ops19empty_memory_format4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEENS5_INS2_12MemoryFormatEEE + 0x1c0) [0x1a1c040]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_python.so (_ZN2at5emptyEN3c108ArrayRefIlEENS0_13TensorOptionsENS0_8optionalINS0_12MemoryFormatEEE + 0xf1) [0xbc47a1]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so (_ZN2at6native5randnEN3c108ArrayRefIlEENS1_8optionalINS_9GeneratorEEENS4_INS1_10ScalarTypeEEENS4_INS1_6LayoutEEENS4_INS1_6DeviceEEENS4_IbEE + 0xe9) [0x1656ce9]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so (_ZN2at6native5randnEN3c108ArrayRefIlEENS1_8optionalINS1_10ScalarTypeEEENS4_INS1_6LayoutEEENS4_INS1_6DeviceEEENS4_IbEE + 0x42) [0x1656e52]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so [0x1eb5ed4]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so [0x1d28954]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so [0x1d18657]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_cpu.so (_ZN2at4_ops5randn4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEE + 0x19e) [0x193bcde]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_python.so [0x8b1816]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python [0x2259b3]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (_PyObject_MakeTpCall + 0x94) [0x73e94]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (_PyEval_EvalFrameDefault + 0x6129) [0x62e79]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python [0x5bdab]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (_PyEval_EvalFrameDefault + 0x60e7) [0x62e37]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python [0x125f0a]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (PyEval_EvalCode + 0x3a) [0x12623a]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python [0x166f37]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (PyRun_FileExFlags + 0xb3) [0x168e83]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (PyRun_SimpleFileExFlags + 0xff) [0x16901f]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python [0x670df]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (Py_BytesMain + 0x6f) [0x676ff]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x2dfd0]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0x7d) [0x2e07d]
========= Host Frame:/home/user/.pyenv/versions/env-3.9.1/bin/python (_start + 0x2e) [0x6630e]
=========
- I also got this backtrace when PyTorch cleans up its memory:
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: an illegal memory access was encountered
Exception raised from uncached_delete at ../c10/cuda/CUDACachingAllocator.cpp:1460 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f466d235d62 in /home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1bdbe (0x7f466d497dbe in /home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0xa4 (0x7f466d21f314 in /home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x29ee09 (0x7f46222c6e09 in /home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0xadfdf1 (0x7f4622b07df1 in /home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #5: THPVariable_subclass_dealloc(_object*) + 0x292 (0x7f4622b080f2 in /home/user/.pyenv/versions/env-3.9.1/lib/python3.9/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x5bfe5 (0x5640a5bc8fe5 in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #7: <unknown function> + 0x1772a5 (0x5640a5ce42a5 in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #8: <unknown function> + 0x1772bd (0x5640a5ce42bd in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #9: <unknown function> + 0xa0f55 (0x5640a5c0df55 in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #10: PyDict_SetItemString + 0x96 (0x5640a5c12e66 in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #11: <unknown function> + 0x14827f (0x5640a5cb527f in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #12: <unknown function> + 0x160c65 (0x5640a5ccdc65 in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #13: Py_BytesMain + 0x74 (0x5640a5bd4704 in /home/user/.pyenv/versions/env-3.9.1/bin/python)
frame #14: <unknown function> + 0x2dfd0 (0x7f468b22cfd0 in /lib/x86_64-linux-gnu/libc.so.6)
frame #15: __libc_start_main + 0x7d (0x7f468b22d07d in /lib/x86_64-linux-gnu/libc.so.6)
frame #16: _start + 0x2e (0x5640a5bd330e in /home/user/.pyenv/versions/env-3.9.1/bin/python)
========= Error: process didn't terminate successfully
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f47794fc000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f478b9f7000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d90735000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d92f85000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d97c83000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d99adf000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f478b9f7000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d90735000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d92f85000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d97c83000
=========
========= Fatal UVM GPU fault of type invalid pde due to invalid address
========= during read access to address 0x7f4d99adf000
=========
========= No CUDA-MEMCHECK results found
I will try to make a full repro code soon, but in the meantime, if this is a known problem/limitation, feel free to let me know