Do I need NCCL, Kubernetes, Docker for PyTorch with CUDA?

mob · August 21, 2023, 8:35am

Hi

I am trying to run llama LLN in Windows, using my GPU and CUDA.

I have followed the instructions for installing a pytorch environment in conda using all the combinations of CUDA 11.8, 1.7, Python 3.9, 3.10, 3.11

I keep getting this error when running this

It seems like this

torch.distributed.init_process_group("nccl")

Is asking for NCCL - but I dont have that installed, and on Conda its a linux only package anyway, and I’m using windows.

Also why is the pytorch package trying to connect to kubernetes ?

Its the torchrun-script in the conda environment folder that fails with
RuntimeError: Distributed package doesn’t have NCCL built in

`python -m torchrun-script --nproc_per_node 1 example_text_completion.py --ckpt_dir …\llama-2-7b --tokenizer_path …\llama-2-7b\tokenizer.model --max_seq_len 128 --max_batch_size 4

NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).

Traceback (most recent call last):

File “H:\llama2\repo\llama\example_text_completion.py”, line 55, in
fire.Fire(main)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\fire\core.py”, line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\fire\core.py”, line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File “U:\Miniconda3\envs\llama2env\lib\site-packages\fire\core.py”, line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File “H:\llama2\repo\llama\example_text_completion.py”, line 18, in main
generator = Llama.build(
File “H:\llama2\repo\llama\llama\generation.py”, line 61, in build
torch.distributed.init_process_group(“nccl”)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\distributed_c10d.py”, line 907, in init_process_group
default_pg = _new_process_group_helper(
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\distributed_c10d.py”, line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn’t have NCCL " “built in”)

RuntimeError: Distributed package doesn’t have NCCL built in

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20656) of binary: U:\Miniconda3\envs\llama2env\python.exe
Traceback (most recent call last):
File “U:\Miniconda3\envs\llama2env\lib\runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “U:\Miniconda3\envs\llama2env\lib\runpy.py”, line 87, in run_code
exec(code, run_globals)
File “U:\Miniconda3\envs\llama2env\Scripts\torchrun-script.py”, line 33, in
sys.exit(load_entry_point(‘torch==2.0.1’, ‘console_scripts’, ‘torchrun’)())
File "U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init.py", line 346, in wrapper
return f(*args, kwargs)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\run.py”, line 794, in main
run(args)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\run.py”, line 785, in run
elastic_launch(
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\launcher\api.py”, line 134, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\launcher\api.py”, line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-08-21_01:17:09
host : Lightning-III
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 20656)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.0 documentation
============================================================`

Everything else seems to work - ie conda, cuda, python, torch - it justs seems to be torchrun

>>> import torch
>>> torch.cuda.is_available()
True

import torch
x = torch.rand(5, 3)
print(x)

tensor([[0.5495, 0.0281, 0.2566],
        [0.7032, 0.1296, 0.8173],
        [0.0329, 0.5500, 0.3025],
        [0.6790, 0.0561, 0.3389],
        [0.4403, 0.5365, 0.5513]])

And my conda packages

(llama2env) H:\llama2\repo>conda list
# packages in environment at U:\Miniconda3\envs\llama2env:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl
brotlipy                  0.7.0           py39h2bbff1b_1003
ca-certificates           2023.05.30           haa95532_0
certifi                   2023.7.22        py39haa95532_0
cffi                      1.15.1           py39h2bbff1b_3
charset-normalizer        2.0.4              pyhd3eb1b0_0
cryptography              41.0.2           py39hac1b9e3_0
cuda-cccl                 12.2.128                      0    nvidia
cuda-cudart               11.7.99                       0    nvidia
cuda-cudart-dev           11.7.99                       0    nvidia
cuda-cupti                11.7.101                      0    nvidia
cuda-libraries            11.7.1                        0    nvidia
cuda-libraries-dev        11.7.1                        0    nvidia
cuda-nvrtc                11.7.99                       0    nvidia
cuda-nvrtc-dev            11.7.99                       0    nvidia
cuda-nvtx                 11.7.91                       0    nvidia
cuda-runtime              11.7.1                        0    nvidia
fairscale                 0.4.13                   pypi_0    pypi
filelock                  3.9.0            py39haa95532_0
fire                      0.5.0                    pypi_0    pypi
freetype                  2.12.1               ha860e81_0
giflib                    5.2.1                h8cc25b3_3
idna                      3.4              py39haa95532_0
intel-openmp              2023.1.0         h59b6b97_46319
jinja2                    3.1.2            py39haa95532_0
jpeg                      9e                   h2bbff1b_1
lerc                      3.0                  hd77b12b_0
libcublas                 11.10.3.66                    0    nvidia
libcublas-dev             11.10.3.66                    0    nvidia
libcufft                  10.7.2.124                    0    nvidia
libcufft-dev              10.7.2.124                    0    nvidia
libcurand                 10.3.3.129                    0    nvidia
libcurand-dev             10.3.3.129                    0    nvidia
libcusolver               11.4.0.1                      0    nvidia
libcusolver-dev           11.4.0.1                      0    nvidia
libcusparse               11.7.4.91                     0    nvidia
libcusparse-dev           11.7.4.91                     0    nvidia
libdeflate                1.17                 h2bbff1b_0
libnpp                    11.7.4.75                     0    nvidia
libnpp-dev                11.7.4.75                     0    nvidia
libnvjpeg                 11.8.0.2                      0    nvidia
libnvjpeg-dev             11.8.0.2                      0    nvidia
libpng                    1.6.39               h8cc25b3_0
libtiff                   4.5.0                h6c2663c_2
libuv                     1.44.2               h2bbff1b_0
libwebp                   1.2.4                hbc33d0d_1
libwebp-base              1.2.4                h2bbff1b_1
llama                     0.0.1                     dev_0    <develop>
lz4-c                     1.9.4                h2bbff1b_0
markupsafe                2.1.1            py39h2bbff1b_0
mkl                       2023.1.0         h6b88ed4_46357
mkl-service               2.4.0            py39h2bbff1b_1
mkl_fft                   1.3.6            py39hf11a4ad_1
mkl_random                1.2.2            py39hf11a4ad_1
mpmath                    1.3.0            py39haa95532_0
networkx                  3.1              py39haa95532_0
numpy                     1.25.2           py39h055cbcc_0
numpy-base                1.25.2           py39h65a83cf_0
openssl                   3.0.10               h2bbff1b_0
pillow                    9.4.0            py39hd77b12b_0
pip                       23.2.1           py39haa95532_0
pycparser                 2.21               pyhd3eb1b0_0
pyopenssl                 23.2.0           py39haa95532_0
pysocks                   1.7.1            py39haa95532_0
python                    3.9.17               h1aa4202_0
pytorch                   2.0.1           py3.9_cuda11.7_cudnn8_0    pytorch
pytorch-cuda              11.7                 h16d0643_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
requests                  2.31.0           py39haa95532_0
sentencepiece             0.1.99                   pypi_0    pypi
setuptools                68.0.0           py39haa95532_0
six                       1.16.0                   pypi_0    pypi
sqlite                    3.41.2               h2bbff1b_0
sympy                     1.11.1           py39haa95532_0
tbb                       2021.8.0             h59b6b97_0
termcolor                 2.3.0                    pypi_0    pypi
tk                        8.6.12               h2bbff1b_0
torchaudio                2.0.2                    pypi_0    pypi
torchvision               0.15.2                   pypi_0    pypi
typing_extensions         4.7.1            py39haa95532_0
tzdata                    2023c                h04d1e81_0
urllib3                   1.26.16          py39haa95532_0
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
wheel                     0.38.4           py39haa95532_0
win_inet_pton             1.1.0            py39haa95532_0
xz                        5.4.2                h8cc25b3_0
zlib                      1.2.13               h8cc25b3_0
zstd                      1.5.5                hd43e919_0

And the environment

python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: (GCC) 4.4.3
Clang version: 11.1.0
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.17 (main, Jul  5 2023, 20:47:11) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 536.99
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=4501
DeviceID=CPU0
Family=107
L2CacheSize=16384
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=4501
Name=AMD Ryzen 9 7950X 16-Core Processor
ProcessorType=3
Revision=24834

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchvision==0.15.2
[conda] blas                      1.0                         mkl
[conda] mkl                       2023.1.0         h6b88ed4_46357
[conda] mkl-service               2.4.0            py39h2bbff1b_1
[conda] mkl_fft                   1.3.6            py39hf11a4ad_1
[conda] mkl_random                1.2.2            py39hf11a4ad_1
[conda] numpy                     1.25.2           py39h055cbcc_0
[conda] numpy-base                1.25.2           py39h65a83cf_0
[conda] pytorch                   2.0.1           py3.9_cuda11.7_cudnn8_0    pytorch
[conda] pytorch-cuda              11.7                 h16d0643_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.0.2                    pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi

mob · August 21, 2023, 9:13am

OK I just found I can change the example script to not use NCCL and this can work - but it seems to be pretty slow - is goo much slower than NCCL ?

ptrblck · August 21, 2023, 12:53pm

Good to hear you’ve figured out that NCCL is not supported on Windows. Also, PyTorch won’t use Kubernetes by itself so your script must call into it.

mob · August 21, 2023, 5:56pm

Thanks for the response @ptrblck, now I know that kebernetes is not the default behavior, and is being called from my script somehow, I can figure out where its coming from. Looks like a package is being initialized with some settings that cause it to try to connect, as there is no explicit code in my script - but I will double check it.

Also in terms of execution speed, the first run of the example script took 50 seconds - which IMO is low performance for my HW, but the second time I ran it it, it only took 7 seconds. I presume on the first run it sets up some cache or at least initialiazes things and that this means after the first run, the system will be much faster.

Do I need NCCL, Kubernetes, Docker for PyTorch with CUDA?

example_text_completion.py FAILED

Failures: <NO_OTHER_FAILURES>

Failures:
<NO_OTHER_FAILURES>