Device not recognized by nerfstudio

I am getting this error every time I am running the training command. I am using an HPC server with SLURM and cuda is loaded using module load cuda cudnn commands. I am running the script on a compute GPU node, but the GPU usage is 0%.

Traceback (most recent call last):
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/process.py”, line 315, in _bootstrap
self.run()
File “/home/ashri/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py”, line 106, in run
ray_bundle = ray_bundle.pin_memory()
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in pin_memory
return self._apply_fn_to_fields(lambda x: x.pin_memory())
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 303, in _apply_fn_to_fields
new_fields = self._apply_fn_to_dict(
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 344, in _apply_fn_to_dict
new_dict[f] = fn(v)
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in
return self._apply_fn_to_fields(lambda x: x.pin_memory())
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I have tried to install pytorch from source by setting TORCH_USE_CUDA_DSA=1, but it doesn’t help out. Any suggestion to solve this is appreciated.

Could you check if your current setup is able to detect the GPU in any other CUDA application (e.g. a test from the CUDASamples)?

If I run torch.cuda.is_available() or torch.randn(shape, device = ‘cuda’) from ipython in the conda environment from the server, it doesn’t give any problem.

This would point to an environment issue and you should check which Python environment is used by default in your terminal and which one is used in Jupyter.

I am using conda activate nerfstudio before running the code in an interactive session.

Traceback (most recent call last):
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/process.py”, line 315, in _bootstrap
self.run()
File “/home/ashri/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py”, line 106, in run
ray_bundle = ray_bundle.pin_memory()
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in pin_memory
return self._apply_fn_to_fields(lambda x: x.pin_memory())
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 303, in _apply_fn_to_fields
new_fields = self._apply_fn_to_dict(
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 344, in _apply_fn_to_dict
new_dict[f] = fn(v)
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in
return self._apply_fn_to_fields(lambda x: x.pin_memory())
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

╭─────────────── viser ───────────────╮
│ ╷ │
│ HTTP │ http://0.0.0.0:7007
│ Websocket │ ws://0.0.0.0:7007 │
│ ╵ │
╰─────────────────────────────────────╯
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
^CTraceback (most recent call last):
File “/home/ashri/nerfstudio/nerfstudio/scripts/train.py”, line 189, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File “/home/ashri/nerfstudio/nerfstudio/scripts/train.py”, line 100, in train_loop
trainer.train()
File “/home/ashri/nerfstudio/nerfstudio/engine/trainer.py”, line 261, in train
loss, loss_dict, metrics_dict = self.train_iteration(step)
File “/home/ashri/nerfstudio/nerfstudio/utils/profiler.py”, line 112, in inner
out = func(*args, **kwargs)
File “/home/ashri/nerfstudio/nerfstudio/engine/trainer.py”, line 496, in train_iteration
_, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
File “/home/ashri/nerfstudio/nerfstudio/utils/profiler.py”, line 112, in inner
out = func(*args, **kwargs)
File “/home/ashri/nerfstudio/nerfstudio/pipelines/base_pipeline.py”, line 300, in get_train_loss_dict
ray_bundle, batch = self.datamanager.next_train(step)
File “/home/ashri/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py”, line 290, in next_train
bundle, batch = self.data_queue.get()
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/queues.py”, line 100, in get
res = self._recv_bytes()
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/connection.py”, line 219, in
recv_bytes
buf = self._recv_bytes(maxlength)
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/connection.py”, line 417, in
_recv_bytes
buf = self._recv(4)
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/connection.py”, line 382, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 55.9284
VanillaPipeline.get_train_loss_dict: 55.9271

IPYTHON

(nerfstudio) [ashri@gl1503 nerfstudio]$ ipython
Python 3.8.19 (default, Mar 20 2024, 19:58:24)
Type ‘copyright’, ‘credits’ or ‘license’ for more information
IPython 8.12.3 – An enhanced Interactive Python. Type ‘?’ for help.

In [1]: import torch

In [2]: tensor = torch.randn((4,3), device = ‘cuda’)

In [3]: print(tensor)
tensor([[-0.0186, 0.0662, 0.0233],
[ 0.8250, -1.3138, 0.0955],
[ 2.4501, -1.4790, -0.2986],
[-0.2575, 1.6024, -1.0451]], device=‘cuda:0’)