ProcessGroupNCCL can not find GPUs

Erica_Zheng · April 5, 2023, 8:45pm

Dear All,

I run into
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
But as I check pytorch, it shows I could found my two GPUs. Please advice. Thank you

>>> import torch
>>> print(torch.__version__)
2.0.0+cu117
>>> print(torch.cuda.device_count())
2

ptrblck · April 5, 2023, 9:52pm

Are you setting CUDA_VISIBLE_DEVICES to an invalid value, which would mask the available devices?

hes95075 · June 8, 2024, 2:11am

Hello，I have already set it：>>> torch.cuda.get_device_name(0)
‘NVIDIA GeForce GTX 1650’
but it is stll wrong:
[2024-06-08 09:51:42.677887] [start training example_text_completion.py(16852455) on jd-i0302-dl for huangyongfeng]
[2024-06-08 09:51:42.789996] [训练前检查] 检查[jd-i0302-dl] cpu memory 21.0 < 256G and total gpu memory 0 < 100M 通过
[2024-06-08 09:51:42.900271] found [new_lla3] from [huangyongfeng] in [/hf_shared/hfai_envs/huangyongfeng/new_lla3_0], start loading…
[2024-06-08 09:51:44.338191] user haienv [new_lla3] loaded
[2024-06-08 09:51:50.584100] [W CUDAFunctions.cpp:108] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: Official Drivers | NVIDIA Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())
[2024-06-08 09:51:50.585106] Traceback (most recent call last):
[2024-06-08 09:51:50.585128] File “example_text_completion.py”, line 64, in
[2024-06-08 09:51:50.585214] fire.Fire(main)
[2024-06-08 09:51:50.585228] File “/hf_shared/hfai_envs/huangyongfeng/new_lla3_0/lib/python3.8/site-packages/fire/core.py”, line 143, in Fire
[2024-06-08 09:51:50.586689] component_trace = _Fire(component, args, parsed_flag_args, context, name)
[2024-06-08 09:51:50.586697] File “/hf_shared/hfai_envs/huangyongfeng/new_lla3_0/lib/python3.8/site-packages/fire/core.py”, line 477, in _Fire
[2024-06-08 09:51:50.587939] component, remaining_args = _CallAndUpdateTrace(
[2024-06-08 09:51:50.587950] File “/hf_shared/hfai_envs/huangyongfeng/new_lla3_0/lib/python3.8/site-packages/fire/core.py”, line 693, in _CallAndUpdateTrace
[2024-06-08 09:51:50.588071] component = fn(*varargs, **kwargs)
[2024-06-08 09:51:50.588078] File “example_text_completion.py”, line 27, in main
[2024-06-08 09:51:50.588135] generator = Llama.build(
[2024-06-08 09:51:50.588147] File “/home/hsr/hfai/llama3-main/llama3-main/llama/generation.py”, line 68, in build
[2024-06-08 09:51:50.589332] torch.distributed.init_process_group(“nccl”)
[2024-06-08 09:51:50.589339] File “/hf_shared/hfai_envs/huangyongfeng/new_lla3_0/lib/python3.8/site-packages/torch/distributed/c10d_logger.py”, line 75, in wrapper
[2024-06-08 09:51:50.590771] return func(*args, **kwargs)
[2024-06-08 09:51:50.590778] File “/hf_shared/hfai_envs/huangyongfeng/new_lla3_0/lib/python3.8/site-packages/torch/distributed/c10d_logger.py”, line 89, in wrapper
[2024-06-08 09:51:50.590829] func_return = func(*args, **kwargs)
[2024-06-08 09:51:50.590835] File “/hf_shared/hfai_envs/huangyongfeng/new_lla3_0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 1312, in init_process_group
[2024-06-08 09:51:50.592212] default_pg, _ = _new_process_group_helper(
[2024-06-08 09:51:50.592219] File “/hf_shared/hfai_envs/huangyongfeng/new_lla3_0/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 1533, in _new_process_group_helper
[2024-06-08 09:51:50.592433] backend_class = ProcessGroupNCCL(
[2024-06-08 09:51:50.592439] ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Is the qustion about “The NVIDIA driver on your system is too old”? but , I run this programm in WSL,and if I update my NVIDIA version to the newst one , it seems not avilable " segemantation fault". How can I address it?

ptrblck · June 8, 2024, 2:13am

You would need to install a newer driver or install a PyTorch binary with a compatible CUDA version (e.g. with CUDA 11.8 in case you are using a driver shipped with CUDA 11).

hes95075 · June 8, 2024, 11:10pm

Thank you, but I have already reinstalled CUDA and PyTorch. Their versions are 12.1, and my NVIDIA driver is 12.4. However, I’m still encountering the same problem. – [W CUDAFunctions.cpp:108] Warning: CUDA initialization: The NVIDIA driver on your system is too old (found version 11040). Please update your GPU driver by downloading and installing a new version from the URL: Official Drivers | NVIDIA Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (function operator())

ptrblck · June 9, 2024, 2:06am

The error message points to a driver shipped with CUDA 11.4, so make sure a single driver is installed, update it to a CUDA 12 driver, or install the PyTorch binaries with CUDA 11.8.

hes95075 · June 10, 2024, 8:32am

Thank you for your help，I have already addressed this problem~