CPU only install still defaults to GPU no matter what

hackdefendr · September 8, 2023, 6:00pm

Question, is NCCL built-in required for CPU only torch runs?

Here is the command I used to install:

conda install pytorch torchvision torchaudio cpuonly -c pytorch

$ torchrun --nproc_per_node 1 example_completion.py --ckpt_dir CodeLlama-34b/ --tokenizer_path CodeLlama-34b/tokenizer.model --max_seq_len 128 --max_batch_size 4

[W socket.cpp:426] [c10d] The server socket cannot be initialized on [::]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:601] [c10d] The client socket cannot be initialized to connect to [localhost]:29500 (errno: 97 - Address family not supported by protocol).
Traceback (most recent call last):
  File "/home/jpop/llama/codellama/example_completion.py", line 55, in <module>
    fire.Fire(main)
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpop/llama/codellama/example_completion.py", line 20, in main
    generator = Llama.build(
                ^^^^^^^^^^^^
  File "/home/jpop/llama/codellama/llama/generation.py", line 68, in build
    torch.distributed.init_process_group("nccl")
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 907, in init_process_group
    default_pg = _new_process_group_helper(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 1013, in _new_process_group_helper
    raise RuntimeError("Distributed package doesn't have NCCL " "built in")
**RuntimeError: Distributed package doesn't have NCCL built in**
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 99331) of binary: /home/jpop/.conda/envs/codellama/bin/python
Traceback (most recent call last):
  File "/home/jpop/.conda/envs/codellama/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jpop/.conda/envs/codellama/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------

Mah_Neh · September 8, 2023, 7:37pm

Where is this model coming from? It should be indicated in the docs. You do not need ncc;l but it’s being initialised by the code likely.

hackdefendr · September 8, 2023, 8:28pm

It is Facebook’s / Meta’s github for CodeLlama I’m using their 34b model. Hmm, If it is in the code from Facebook, well maybe I can figure out how to use my Nvidia, it is an older gaming GPU so I’m not holding my breath. I wonder if they have a CPU repository.

Mah_Neh · September 8, 2023, 8:44pm

I’d try with colab and 7B first What's the machine requirements for each model? · Issue #30 · facebookresearch/codellama · GitHub, and use the GPUs

The 34B parameters is way to heavy and will take minutes to execute in your CPU I assume.

Anyhow, here there is someone with your same issue RuntimeError: Distributed package doesn't have NCCL built in · Issue #70 · facebookresearch/codellama · GitHub

And how they fixed it (for the 7B):

As of now, for 7B parameter model, its working on windows by making changes to generator.py file by using torch.distributed.init_process_group(“gloo”), instead of “nccl”.
Is this methodology fine if I want to use high parameter model in future?

hackdefendr · September 8, 2023, 9:07pm

Hmm, well my CPU is one of the first Intel i9 chips. My attempts to use the Nvidia GPU is spitting out this error:

RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Let me switch environments and try without GPU but using gloo instead of nccl.

Mah_Neh · September 8, 2023, 9:15pm

I can not give advice about GPU stuff, it’s normally not obvious what to do to me.

But again, if you search the issues there is one and some pointers:

github.com/facebookresearch/codellama

torchrun --nproc_per_node 2 example_instructions.py --ckpt_dir CodeLlama-13b-Instruct/ --tokenizer_path CodeLlama-13b-Instruct/tokenizer.model --max_seq_len 8192 --max_batch_size 4

opened 01:36PM - 29 Aug 23 UTC

alvynabranches

model-usage

WARNING:torch.distributed.run: ***************************************** Setti…ng OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** > initializing model parallel with size 2 > initializing ddp with size 1 > initializing pipeline with size 1 Traceback (most recent call last): File "/home/azureuser/codellama/example_instructions.py", line 68, in <module> fire.Fire(main) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/azureuser/codellama/example_instructions.py", line 20, in main generator = Llama.build( File "/home/azureuser/codellama/llama/generation.py", line 90, in build checkpoint = torch.load(ckpt_path, map_location="cpu") File "/home/azureuser/.local/lib/python3.10/site-packages/torch/serialization.py", line 815, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/serialization.py", line 1033, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'. Traceback (most recent call last): File "/home/azureuser/codellama/example_instructions.py", line 68, in <module> fire.Fire(main) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/azureuser/.local/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/azureuser/codellama/example_instructions.py", line 20, in main generator = Llama.build( File "/home/azureuser/codellama/llama/generation.py", line 75, in build torch.cuda.set_device(local_rank) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 350, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 14881) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/azureuser/.local/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(*args, **kwargs) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/azureuser/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ example_instructions.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-08-29_13:34:23 host : llm.internal.cloudapp.net rank : 1 (local_rank: 1) exitcode : 1 (pid: 14882) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-29_13:34:23 host : llm.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 14881) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================