Apple Silicon & torchrun: Distributed package doesn't have NCCL built in

Hi, I’m trying to get torchrun to work on my M1 Pro Mac. I saw the other forum posts on this topic, but development happens rapidly and I didn’t get it to work. So, I downloaded Llama 3, ran pip install -e setup.py. Trying to run torchrun with the following command:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --no_cuda=True --ckpt_dir Meta-Llama-3-8B-Instruct/ \
    --tokenizer_path Meta-Llama-3-8B-Instruct/tokenizer.model \
    --max_seq_len 512 --max_batch_size 6

Then I get the following error:

Traceback (most recent call last):
  File "/Users/vortec/workspace/llm/llama3/example_chat_completion.py", line 84, in <module>
    fire.Fire(main)
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/llm/llama3/example_chat_completion.py", line 31, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/Users/vortec/workspace/llm/llama3/llama/generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1184, in init_process_group
    default_pg, _ = _new_process_group_helper(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/vortec/workspace/instances/llama3/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1302, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL built in")
RuntimeError: Distributed package doesn't have NCCL built in

Here’s the output of collect_env:

Collecting environment information...
PyTorch version: 2.2.2
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.2.1 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.1.0.2.5)
CMake version: Could not collect
Libc version: N/A

Python version: 3.11.4 (main, Jun 20 2023, 17:23:00) [Clang 14.0.3 (clang-1403.0.22.14.1)] (64-bit runtime)
Python platform: macOS-14.2.1-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M1 Pro

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.2
[pip3] torchaudio==2.2.2
[pip3] torchvision==0.17.2
[conda] Could not collect

What can I do to fix it?

Don’t use NCCL for distributed operations, as it’s not supported on Mac since it needs an NVIDIA GPU.
I’m also unsure if your system even has multiple GPUs (MPS devices), so you might want to disable the distributed usage.

I think my question is: What do I need to do to change this?

Here’s a concise solution for using PyTorch Distributed (torchrun) on Apple Silicon (M1/M2) where NCCL is unavailable:


Problem

NCCL is not supported on macOS, so distributed training with torchrun fails with errors like:

RuntimeError: Distributed package doesn't have NCCL built in

Solution

Use the gloo backend instead of NCCL. Follow these steps:

1. Install PyTorch with MPS Support

Ensure you have PyTorch ≥ 2.0 (with Apple Silicon MPS support):

pip3 install torch torchvision torchaudio

2. Modify Your Training Script

Explicitly set the backend to gloo and use device="mps" for GPU acceleration:

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup(rank, world_size):
    # Use "gloo" backend for CPU/MPS
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "12355"
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def main(rank, world_size):
    setup(rank, world_size)
    model = YourModel().to(rank)  # Use `device="mps"` if needed
    ddp_model = DDP(model, device_ids=[rank])
    # ... training logic ...
    dist.destroy_process_group()

if __name__ == "__main__":
    world_size = 2  # Match your CPU cores or MPS capability
    torch.multiprocessing.spawn(main, args=(world_size,), nprocs=world_size)

3. Launch with torchrun

Force the gloo backend and specify processes:

torchrun \
  --nproc_per_node=2 \          # Match your CPU cores
  --nnodes=1 \
  --node_rank=0 \
  --master_addr=localhost \
  --master_port=12355 \
  your_script.py

Key Notes

  • MPS Limitations: While device="mps" uses Apple Silicon GPUs, gloo runs distributed communication on CPU. This hybrid setup may not be optimal but works for basic tasks.
  • CPU Fallback: If MPS causes issues, use device="cpu" for full compatibility.
  • Performance: Expect slower speeds compared to NCCL on NVIDIA GPUs.

Alternatives

For advanced distributed training:

  1. Linux + NVIDIA GPU: Use NCCL in a cloud/remote server.
  2. MLX Framework: Optimized for Apple Silicon (e.g., MLX-Whisper).