Device not recognized by nerfstudio

Shrikant_Arvavasu · April 18, 2024, 12:12am

I am getting this error every time I am running the training command. I am using an HPC server with SLURM and cuda is loaded using module load cuda cudnn commands. I am running the script on a compute GPU node, but the GPU usage is 0%.

Traceback (most recent call last):
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/process.py”, line 315, in _bootstrap
self.run()
File “/home/ashri/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py”, line 106, in run
ray_bundle = ray_bundle.pin_memory()
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in pin_memory
return self._apply_fn_to_fields(lambda x: x.pin_memory())
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 303, in _apply_fn_to_fields
new_fields = self._apply_fn_to_dict(
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 344, in _apply_fn_to_dict
new_dict[f] = fn(v)
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in
return self._apply_fn_to_fields(lambda x: x.pin_memory())
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I have tried to install pytorch from source by setting TORCH_USE_CUDA_DSA=1, but it doesn’t help out. Any suggestion to solve this is appreciated.

ptrblck · April 18, 2024, 4:13am

Could you check if your current setup is able to detect the GPU in any other CUDA application (e.g. a test from the CUDASamples)?

Shrikant_Arvavasu · April 18, 2024, 4:36pm

If I run torch.cuda.is_available() or torch.randn(shape, device = ‘cuda’) from ipython in the conda environment from the server, it doesn’t give any problem.

ptrblck · April 18, 2024, 6:38pm

This would point to an environment issue and you should check which Python environment is used by default in your terminal and which one is used in Jupyter.

Shrikant_Arvavasu · April 19, 2024, 5:07am

I am using conda activate nerfstudio before running the code in an interactive session.

Traceback (most recent call last):
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/process.py”, line 315, in _bootstrap
self.run()
File “/home/ashri/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py”, line 106, in run
ray_bundle = ray_bundle.pin_memory()
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in pin_memory
return self._apply_fn_to_fields(lambda x: x.pin_memory())
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 303, in _apply_fn_to_fields
new_fields = self._apply_fn_to_dict(
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 344, in _apply_fn_to_dict
new_dict[f] = fn(v)
File “/home/ashri/nerfstudio/nerfstudio/utils/tensor_dataclass.py”, line 273, in
return self._apply_fn_to_fields(lambda x: x.pin_memory())
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

╭─────────────── viser ───────────────╮
│ ╷ │
│ HTTP │ http://0.0.0.0:7007 │
│ Websocket │ ws://0.0.0.0:7007 │
│ ╵ │
╰─────────────────────────────────────╯
[NOTE] Not running eval iterations since only viewer is enabled.
Use --vis {wandb, tensorboard, viewer+wandb, viewer+tensorboard} to run with eval.
No Nerfstudio checkpoint to load, so training from scratch.
Disabled comet/tensorboard/wandb event writers
^CTraceback (most recent call last):
File “/home/ashri/nerfstudio/nerfstudio/scripts/train.py”, line 189, in launch
main_func(local_rank=0, world_size=world_size, config=config)
File “/home/ashri/nerfstudio/nerfstudio/scripts/train.py”, line 100, in train_loop
trainer.train()
File “/home/ashri/nerfstudio/nerfstudio/engine/trainer.py”, line 261, in train
loss, loss_dict, metrics_dict = self.train_iteration(step)
File “/home/ashri/nerfstudio/nerfstudio/utils/profiler.py”, line 112, in inner
out = func(*args, **kwargs)
File “/home/ashri/nerfstudio/nerfstudio/engine/trainer.py”, line 496, in train_iteration
_, loss_dict, metrics_dict = self.pipeline.get_train_loss_dict(step=step)
File “/home/ashri/nerfstudio/nerfstudio/utils/profiler.py”, line 112, in inner
out = func(*args, **kwargs)
File “/home/ashri/nerfstudio/nerfstudio/pipelines/base_pipeline.py”, line 300, in get_train_loss_dict
ray_bundle, batch = self.datamanager.next_train(step)
File “/home/ashri/nerfstudio/nerfstudio/data/datamanagers/parallel_datamanager.py”, line 290, in next_train
bundle, batch = self.data_queue.get()
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/queues.py”, line 100, in get
res = self._recv_bytes()
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/connection.py”, line 219, in
recv_bytes
buf = self._recv_bytes(maxlength)
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/connection.py”, line 417, in
_recv_bytes
buf = self._recv(4)
File “/home/ashri/.conda/envs/nerfstudio/lib/python3.8/site-packages/multiprocess/connection.py”, line 382, in _recv
chunk = read(handle, remaining)
KeyboardInterrupt

Printing profiling stats, from longest to shortest duration in seconds
Trainer.train_iteration: 55.9284
VanillaPipeline.get_train_loss_dict: 55.9271

IPYTHON

(nerfstudio) [ashri@gl1503 nerfstudio]$ ipython
Python 3.8.19 (default, Mar 20 2024, 19:58:24)
Type ‘copyright’, ‘credits’ or ‘license’ for more information
IPython 8.12.3 – An enhanced Interactive Python. Type ‘?’ for help.

In [1]: import torch

In [2]: tensor = torch.randn((4,3), device = ‘cuda’)

In [3]: print(tensor)
tensor([[-0.0186, 0.0662, 0.0233],
[ 0.8250, -1.3138, 0.0955],
[ 2.4501, -1.4790, -0.2986],
[-0.2575, 1.6024, -1.0451]], device=‘cuda:0’)