Apple Silicon & torchrun: Distributed package doesn't have NCCL built in

Hi, I’m trying to get torchrun to work on my M1 Pro Mac. I saw the other forum posts on this topic, but development happens rapidly and I didn’t get it to work. So, I downloaded Llama 3, ran pip install -e setup.py. Trying to run torchrun with the following command:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --no_cuda=True --ckpt_dir Meta-Llama-3-8B-Instruct/ \
    --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Then I get the following error:

Traceback (most recent call last):
  File "/Users/vortec/workspace/llm/llama3/example_chat_completion.py", line 84, in <module>
    fire.Fire(main)
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/llm/llama3/example_chat_completion.py", line 31, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/Users/vortec/workspace/llm/llama3/llama/generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1302, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in

Here’s the output of collect_env:

Collecting environment information...
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.2.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.1.0.2.5)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime)
Python platform: macOS-14.2.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] torchaudio==2.2.2
[pip3] torchvision==0.17.2
[conda] Could not collect

What can I do to fix it?

Don’t use NCCL for distributed operations, as it’s not supported on Mac since it needs an NVIDIA GPU.
I’m also unsure if your system even has multiple GPUs (MPS devices), so you might want to disable the distributed usage.

I think my question is: What do I need to do to change this?