Distributed package doesn't have NCCL built in!

Hi, I have been trying to install llama2 model locally on my windows 10 from their github, but when I run the commad:

torchrun --nproc_per_node 1 example_completion.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 128 --max_batch_size 4

I get this error:

NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
Traceback (most recent call last):
  File "D:\shahzaib\codellama\example_completion.py", line 55, in <module>
    fire.Fire(main)
  File "D:\shahzaib\env\Lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\shahzaib\env\Lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "D:\shahzaib\env\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\shahzaib\codellama\example_completion.py", line 20, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "D:\shahzaib\codellama\llama\generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "D:\shahzaib\env\Lib\site-packages\torch\distributed\distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\shahzaib\env\Lib\site-packages\torch\distributed\distributed_c10d.py", line 1013, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
RuntimeError: Distributed package doesn't have NCCL built in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 7896) of binary: D:\shahzaib\env\Scripts\python.exe
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "D:\shahzaib\env\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "D:\shahzaib\env\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "D:\shahzaib\env\Lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "D:\shahzaib\env\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "D:\shahzaib\env\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\shahzaib\env\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-16_21:03:45
  host      : DESKTOP-7N7T678
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 7896)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
1 Like

The docs mention that NCCL is not available on windows. Maybe try changing it to one of other backends (e.g. GLOO) and see if that works?

I’m a bit new to this and not familiar with GLOO. Can you tell me how I can use it?

In the error message you’ve supplied, there’s the following line:

This tells PyTorch to do the setup required for distributed training and utilize the backend called “nccl” (which is more recommended usually and I think it has more features, but seems to not be available for windows). The first thing I would do is try to change the string there from “nccl” to “gloo” and see if it works. There might be other features supported in llama2 that require nccl specifically, but that would be the first modification to try.

Alternatively, if the option is available in your case - Windows has WSL2, which provides a linux system within windows, and might support more features including NCCL.

For both cases, if you try them and run into any other issues - please let us know.

I changed the string from “nccl” to “gloo” in line 68 of generation.py and got this error now:

NOTE: Redirects are currently not supported in Windows or MacOs.
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [DESKTOP-7N7T678]:29500 (system error: 10049 - The requested address is not valid in its context.).
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
Traceback (most recent call last):
  File "C:\Users\GPU2\codellama\example_completion.py", line 55, in <module>
    fire.Fire(main)
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GPU2\codellama\example_completion.py", line 20, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "C:\Users\GPU2\codellama\llama\generation.py", line 90, in build
    checkpoint = torch.load(ckpt_path, map_location="cpu")
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\serialization.py", line 1033, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11372) of binary: C:\Users\GPU2\AppData\Local\Programs\Python\Python311\python.exe
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Scripts\torchrun.exe\__main__.py", line 7, in <module>
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\elastic\multiprocessing\errors\__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\run.py", line 794, in main
    run(args)
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\run.py", line 785, in run
    elastic_launch(
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\GPU2\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\distributed\launcher\api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-09-16_22:03:58
  host      : DESKTOP-7N7T678
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 11372)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

You are running torchrun, a utility meant for distributed workflows. It is not wrong of you to do so, but the error messages you are getting contain some distributed-related text that you have to filter through to find the actual issue to debug. In this case, the line with the actual error is:

which seems like pickle telling you that the file is invalid or missing. I’d add to the code a print of f and pickle_load_args, see what it is trying to do and try to call that myself (i.e. from python but not from the llama2 code).

Also, it might be easier to debug if you can run that codebase without torchrun, i.e.:

python example_completion.py --ckpt_dir CodeLlama-7b/ --tokenizer_path CodeLlama-7b/tokenizer.model --max_seq_len 128 --max_batch_size 4

The torchrun --nproc_per_node 1 prefix I omitted means “use a distributed launcher meant for multiple GPUs, but I have a single GPU”. However, it depends on the codebase, which I’m not familiar with - if it assumes you always run it with torchrun you will see errors about distributed again.