RuntimeError: CUDA error: invalid argument

Andymulb · December 10, 2024, 3:39pm

Hi, I’m currently testing various SLAM systems for autonomous driving on embedded hardware (an aarch64-based NVIDIA Jetson AGX Orin with 64 GB of unified memory). I cannot run some of the systems due to a PyTorch multiprocessing error. Here’s the error message I received when testing DROID-Splat:

[Main]: Load pretrained checkpoint from ./pretrained/droid.pth!
/home/DROID-Splat/src/slam.py:260: FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See pytorch/SECURITY.md at main · pytorch/pytorch · GitHub for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don’t have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = OrderedDict([(k.replace(“module.”, “”), v) for (k, v) in torch.load(pretrained).items()])
[2024-12-09 08:08:06,685][main][INFO] - Running on 2000 frames
/home/anaconda3/envs/DROID-Splat/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension:
warn(f"Failed to load image Python extension: {e}")
Traceback (most recent call last):
File “”, line 1, in
File “/home/anaconda3/envs/DROID-Splat/lib/python3.10/multiprocessing/spawn.py”, line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File “/home/anaconda3/envs/DROID-Splat/lib/python3.10/multiprocessing/spawn.py”, line 126, in _main
self = reduction.pickle.load(from_parent)
File “/home/anaconda3/envs/DROID-Splat/lib/python3.10/site-packages/torch/multiprocessing/reductions.py”, line 149, in rebuild_cuda_tensor
storage = storage_cls._new_shared_cuda(
File “/home/anaconda3/envs/DROID-Splat/lib/python3.10/site-packages/torch/storage.py”, line 1420, in _new_shared_cuda
return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
RuntimeError: CUDA error: invalid argument
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I also just tested DROID-Splat on a different x86-based system with an RTX 4080 GPU (CUDA 12.4), and everything works fine apart from the memory boundaries. However, the Jetson AGX Orin has 64 GB has sufficient memory. Do you have an idea what might cause the problem?

On the Jetson PyTorch 2.5.0a0+872d972e41.nv24.08 is installed, which is the only version available that works with CUDA 12.6

NailikLN · December 11, 2024, 1:11pm

Hi ! i have the same problem using torch multiprocessing and spawn method. I pass as an argument torch models (yolo models) and i have more than enough GPU memory,.i also tested it on a an RTX 4080 and it worked, however it was not on arm like the jetson are (i don’t know if it is the problem).
My test were run on docker and my jetson orin NX run on jetpack 6.

Did you manage to fix the issue and if not does someone have an idea ?

Andymulb · December 11, 2024, 3:42pm

Hi @NailikLN! I didn’t solve the issue yet. I’m testing other SLAM systems as well and until now, I encountered a similar issue when testing DROID-SLAM, and exactly the same issue when testing NeRF-LOAM. Do you have JetPack 6.1 or JetPack 6.0 and CUDA 12.6.10 or CUDA 12.2.1 installed? If I don’t find another solution I will try to downgrade JetPack to use a earlier CUDA version than 12.6.

NailikLN · December 11, 2024, 5:31pm

I also have jetpack 6.0 and cuda 12.2, let me know if jetpack 5 resolve the issue ! I also tested some others ways but none of them worked (different docker image etc) so i do not see what else it could be…

Doruk_Sonmez · December 13, 2024, 7:14pm

Hi, I’m also having the same problem on NVIDIA Jetson AGX Orin 64GB while using PyTorch 2.5.0a0+872d972e41.nv24.08 and JetPack 6.0 (but inside a JetPack 6.1-based docker container). My workload is related to some LLM/VLM application but the error is pretty much the same:

Process EmbeddingProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/nvidia/myapp/process_base.py", line 188, in run
    item = self._queue.get()
  File "/usr/lib/python3.10/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/reductions.py", line 149, in rebuild_cuda_tensor
    storage = storage_cls._new_shared_cuda(
  File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 1420, in _new_shared_cuda
    return torch.UntypedStorage._new_shared_cuda(*args, **kwargs)
RuntimeError: CUDA error: invalid argument
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Andymulb · January 7, 2025, 4:39pm

I finally found the solution and it’s very simple. I tried to install HI-SLAM2 on the Jetson AGX Orin with Jetpack 6.1 and CUDA 12.6 when running the demo, a similar torch multiprocessing error occurred. The solution was to change the compile args of the project’s setup.py and the setup.py files of the submodules as well due to the newer architecture of the Jetson AGX Orin.

Here is an example in HI-SLAM2:

setup(
    name='droid_backends',
    ext_modules=[
        CUDAExtension('droid_backends',
            include_dirs=[osp.join(ROOT, 'thirdparty/eigen')],
            sources=[
                'src/droid.cpp', 
                'src/droid_kernels.cu',
                'src/correlation_kernels.cu',
                'src/altcorr_kernel.cu',
            ],
            extra_compile_args={
                'cxx': ['-O3'],
                'nvcc': ['-O3',
                    # '-gencode=arch=compute_60,code=sm_60',
                    # '-gencode=arch=compute_61,code=sm_61',
                    # '-gencode=arch=compute_70,code=sm_70',
                    # '-gencode=arch=compute_75,code=sm_75',
                    # '-gencode=arch=compute_80,code=sm_80',
                    # '-gencode=arch=compute_86,code=sm_86',
                    '-gencode=arch=compute_87,code=sm_87', # this is the important change
                ]
            }),
    ],
    cmdclass={ 'build_ext' : BuildExtension }
)

vixp24 · March 11, 2025, 4:08pm

Hi All,

I’ve got the same issue building MASt3R-SLAM on the Jetson Orin AGX running JP6.0 with CUDA 12.2, and using a docker container with Pytorch 2.4.0a0+07cecf4168.nv24.05.

@Andymulb , I’ve tried your suggestion with only specifying nvcc args applicable for the Jetson only, but I ended up with the same issue! Curious if you had changed anything else? Or if anyone has come across another solution.

I’m currently upgrading to JP6.1, CUDA 12.6 and Pytorch 2.5, to see if that resolves it for me.

vixp24 · March 12, 2025, 5:36pm

Hi @Andymulb,

I tried again on JP6.1, CUDA 12.6 and Pytorch 2.5, and added the arch code changes in all relevant packages, and I keep getting the same runtime issue.

I came across this post and some others that mention that torch.multiprocessing is effectively not supported on the Jetson Orin AGX.

I don’t understand how you were able to get past this issue. Can you please share which SLAM system you were able to run successfully? Did it use torch.multiprocessing?

I am hitting the same runtime issue with DroidSLAM and MASt3R-SLAM, unless I get rid of torch.multiprocessing

Thanks,