PROBLEM: NCCL. Distributed not. found error.
I do undersytand that Apple M2 Silicon does not. support. CUDA and. NCCL but how to fix this error.?
(New_Torch) akram_personal@AKRAMs-MacBook-Pro llama % python3 test_torch.py
tensor([[0.5552, 0.4753, 0.6758],
[0.3080, 0.7625, 0.7667],
[0.5621, 0.6176, 0.2445],
[0.6803, 0.3974, 0.7331],
[0.3485, 0.3801, 0.9699]])
False
(New_Torch) akram_personal@AKRAMs-MacBook-Pro llama %
Test Script:
import torch
x = torch.rand(5,3)
print(x)
print(torch.cuda.is_available())
#print(torch.cuda.nccl.is_available(tensors=x))
#print(torch.cuda.nccl.is_available(torch.randn(1).cuda(
(New_Torch) akram_personal@AKRAMs-MacBook-Pro llama % torchrun --nproc_per_node 1 example_text_completion.py
–ckpt_dir llama-2-7b/
–tokenizer_path tokenizer.model
–max_seq_len 128 --max_batch_size 6
[2024-03-05 23:30:17,309] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py:608: UserWarning: Attempted to get default timeout for nccl backend, but NCCL support is not compiled
warnings.warn(“Attempted to get default timeout for nccl backend, but NCCL support is not compiled”)
Traceback (most recent call last):
File “/Users/akram_personal/AKRAM_CODE_FOLDER/AKRAM_LLM/LLAMA_MODELS/Meta_Proj/llama/example_text_completion.py”, line 69, in
fire.Fire(main)
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/fire/core.py”, line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/fire/core.py”, line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/fire/core.py”, line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File “/Users/akram_personal/AKRAM_CODE_FOLDER/AKRAM_LLM/LLAMA_MODELS/Meta_Proj/llama/example_text_completion.py”, line 32, in main
generator = Llama.build(
^^^^^^^^^^^^
File “/Users/akram_personal/AKRAM_CODE_FOLDER/AKRAM_LLM/LLAMA_MODELS/Meta_Proj/llama/llama/generation.py”, line 85, in build
torch.distributed.init_process_group(“nccl”)
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/c10d_logger.py”, line 86, in wrapper
func_return = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py”, line 1184, in init_process_group
default_pg, _ = _new_process_group_helper(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py”, line 1302, in _new_process_group_helper
raise RuntimeError(“Distributed package doesn’t have NCCL built in”)
RuntimeError: Distributed package doesn’t have NCCL built in
[2024-03-05 23:30:22,330] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 48795) of binary: /opt/anaconda3/envs/New_Torch/bin/python
Traceback (most recent call last):
File “/opt/anaconda3/envs/New_Torch/bin/torchrun”, line 33, in
sys.exit(load_entry_point(‘torch==2.2.1’, ‘console_scripts’, ‘torchrun’)())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/run.py”, line 812, in main
run(args)
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/run.py”, line 803, in run
elastic_launch(
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/launcher/api.py”, line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/opt/anaconda3/envs/New_Torch/lib/python3.11/site-packages/torch/distributed/launcher/api.py”, line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
example_text_completion.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-03-05_23:30:22
host : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 48795)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.2 documentation
(New_Torch) akram_personal@AKRAMs-MacBook-Pro llama %