torch.distributed.DistBackendError

Hello! I got this ERROR when I run the code from ‘GitHub - tatsu-lab/stanford_alpaca: Code and documentation to train Stanford's Alpaca models, and generate the data.’. Any suggestions here? Thanks a lot!

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:115 NCCL WARN Cuda failure ‘named symbol not found’

2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] enqueue.cc:130 NCCL WARN Cuda failure ‘named symbol not found’
2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] NCCL INFO init.cc:1285 → 1
2023-12-10 22:16:51.931 n193-016-214:208729:208819 [1] NCCL INFO group.cc:64 → 1 [Async thread]
2023-12-10 22:16:51.932 n193-016-214:208729:208729 [1] NCCL INFO group.cc:422 → 1
2023-12-10 22:16:51.932 n193-016-214:208729:208729 [1] NCCL INFO group.cc:106 → 1
Traceback (most recent call last):
File “/opt/tiger/stanford_alpaca/train.py”, line 222, in
train()
File “/opt/tiger/stanford_alpaca/train.py”, line 186, in train
model = transformers.AutoModelForCausalLM.from_pretrained(
File “/home/tiger/.local/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py”, line 471, in from_pretrained
return model_class.from_pretrained(
File “/home/tiger/.local/lib/python3.9/site-packages/transformers/modeling_utils.py”, line 2498, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py”, line 459, in wrapper
f(module, *args, **kwargs)
File “/home/tiger/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py”, line 659, in init
self.model = LlamaModel(config)
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py”, line 459, in wrapper
f(module, *args, **kwargs)
File “/home/tiger/.local/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py”, line 463, in init
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py”, line 466, in wrapper
self._post_init_method(module)
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py”, line 1000, in _post_init_method
self._zero_init_param(param)
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py”, line 956, in _zero_init_param
dist.broadcast(param, 0, self.get_dp_process_group())
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py”, line 117, in log_wrapper
return func(*args, **kwargs)
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/comm.py”, line 224, in broadcast
return cdb.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File “/home/tiger/.local/lib/python3.9/site-packages/deepspeed/comm/torch.py”, line 196, in broadcast
return torch.distributed.broadcast(tensor=tensor, src=src, group=group, async_op=async_op)
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/c10d_logger.py”, line 47, in wrapper
return func(*args, **kwargs)
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/distributed_c10d.py”, line 1906, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1367, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure ‘named symbol not found’
[2023-12-10 22:16:54,324] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 208728) of binary: /usr/bin/python3
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 346, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py”, line 806, in main
run(args)
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py”, line 797, in run
elastic_launch(
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py”, line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train.py FAILED

Failures:
[1]:
time : 2023-12-10_22:16:54
host : n193-016-214.byted.org
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 208729)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.1 documentation

Root Cause (first observed failure):
[0]:
time : 2023-12-10_22:16:54
host : n193-016-214.byted.org
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 208728)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.1 documentation

Sun Dec 10 22:19:05 2023
±----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01 Driver Version: 450.191.01 CUDA Version: 12.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-SXM-80GB On | 00000000:4A:00.0 Off | 0 |
| N/A 35C P0 93W / 400W | 0MiB / 81252MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+
| 1 A100-SXM-80GB On | 00000000:4E:00.0 Off | 0 |
| N/A 36C P0 92W / 400W | 0MiB / 81252MiB | 0% Default |
| | | Disabled |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Your 450 NVIDIA driver is too old unless you are using forward compatibility in your setup.