Do I need NCCL, Kubernetes, Docker for PyTorch with CUDA?

Hi

I am trying to run llama LLN in Windows, using my GPU and CUDA.

I have followed the instructions for installing a pytorch environment in conda using all the combinations of CUDA 11.8, 1.7, Python 3.9, 3.10, 3.11

I keep getting this error when running this

It seems like this

torch.distributed.init_process_group("nccl")

Is asking for NCCL - but I dont have that installed, and on Conda its a linux only package anyway, and I’m using windows.

Also why is the pytorch package trying to connect to kubernetes ?

Its the torchrun-script in the conda environment folder that fails with
RuntimeError: Distributed package doesn’t have NCCL built in

`python -m torchrun-script --nproc_per_node 1 example_text_completion.py --ckpt_dir …\llama-2-7b --tokenizer_path …\llama-2-7b\tokenizer.model --max_seq_len 128 --max_batch_size 4

NOTE: Redirects are currently not supported in Windows or MacOs.
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).
[W C:\cb\pytorch_1000000000000\work\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:29500 (system error: 10049 - The requested address is not valid in its context.).

Traceback (most recent call last):

File “H:\llama2\repo\llama\example_text_completion.py”, line 55, in
fire.Fire(main)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\fire\core.py”, line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\fire\core.py”, line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File “U:\Miniconda3\envs\llama2env\lib\site-packages\fire\core.py”, line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File “H:\llama2\repo\llama\example_text_completion.py”, line 18, in main
generator = Llama.build(
File “H:\llama2\repo\llama\llama\generation.py”, line 61, in build
torch.distributed.init_process_group(“nccl”)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\distributed_c10d.py”, line 907, in init_process_group
default_pg = _new_process_group_helper(
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\distributed_c10d.py”, line 1013, in _new_process_group_helper
raise RuntimeError("Distributed package doesn’t have NCCL " “built in”)

RuntimeError: Distributed package doesn’t have NCCL built in

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 20656) of binary: U:\Miniconda3\envs\llama2env\python.exe
Traceback (most recent call last):
File “U:\Miniconda3\envs\llama2env\lib\runpy.py”, line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File “U:\Miniconda3\envs\llama2env\lib\runpy.py”, line 87, in run_code
exec(code, run_globals)
File “U:\Miniconda3\envs\llama2env\Scripts\torchrun-script.py”, line 33, in
sys.exit(load_entry_point(‘torch==2.0.1’, ‘console_scripts’, ‘torchrun’)())
File "U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\elastic\multiprocessing\errors_init
.py", line 346, in wrapper
return f(*args, **kwargs)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\run.py”, line 794, in main
run(args)
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\run.py”, line 785, in run
elastic_launch(
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\launcher\api.py”, line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “U:\Miniconda3\envs\llama2env\lib\site-packages\torch\distributed\launcher\api.py”, line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

example_text_completion.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2023-08-21_01:17:09
host : Lightning-III
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 20656)
error_file: <N/A>
traceback : To enable traceback see: Error Propagation — PyTorch 2.0 documentation
============================================================`

Everything else seems to work - ie conda, cuda, python, torch - it justs seems to be torchrun

>>> import torch
>>> torch.cuda.is_available()
True
import torch
x = torch.rand(5, 3)
print(x)

tensor([[0.5495, 0.0281, 0.2566],
        [0.7032, 0.1296, 0.8173],
        [0.0329, 0.5500, 0.3025],
        [0.6790, 0.0561, 0.3389],
        [0.4403, 0.5365, 0.5513]])

And my conda packages

(llama2env) H:\llama2\repo>conda list
# packages in environment at U:\Miniconda3\envs\llama2env:
#
# Name                    Version                   Build  Channel
blas                      1.0                         mkl
brotlipy                  0.7.0           py39h2bbff1b_1003
ca-certificates           2023.05.30           haa95532_0
certifi                   2023.7.22        py39haa95532_0
cffi                      1.15.1           py39h2bbff1b_3
charset-normalizer        2.0.4              pyhd3eb1b0_0
cryptography              41.0.2           py39hac1b9e3_0
cuda-cccl                 12.2.128                      0    nvidia
cuda-cudart               11.7.99                       0    nvidia
cuda-cudart-dev           11.7.99                       0    nvidia
cuda-cupti                11.7.101                      0    nvidia
cuda-libraries            11.7.1                        0    nvidia
cuda-libraries-dev        11.7.1                        0    nvidia
cuda-nvrtc                11.7.99                       0    nvidia
cuda-nvrtc-dev            11.7.99                       0    nvidia
cuda-nvtx                 11.7.91                       0    nvidia
cuda-runtime              11.7.1                        0    nvidia
fairscale                 0.4.13                   pypi_0    pypi
filelock                  3.9.0            py39haa95532_0
fire                      0.5.0                    pypi_0    pypi
freetype                  2.12.1               ha860e81_0
giflib                    5.2.1                h8cc25b3_3
idna                      3.4              py39haa95532_0
intel-openmp              2023.1.0         h59b6b97_46319
jinja2                    3.1.2            py39haa95532_0
jpeg                      9e                   h2bbff1b_1
lerc                      3.0                  hd77b12b_0
libcublas                 11.10.3.66                    0    nvidia
libcublas-dev             11.10.3.66                    0    nvidia
libcufft                  10.7.2.124                    0    nvidia
libcufft-dev              10.7.2.124                    0    nvidia
libcurand                 10.3.3.129                    0    nvidia
libcurand-dev             10.3.3.129                    0    nvidia
libcusolver               11.4.0.1                      0    nvidia
libcusolver-dev           11.4.0.1                      0    nvidia
libcusparse               11.7.4.91                     0    nvidia
libcusparse-dev           11.7.4.91                     0    nvidia
libdeflate                1.17                 h2bbff1b_0
libnpp                    11.7.4.75                     0    nvidia
libnpp-dev                11.7.4.75                     0    nvidia
libnvjpeg                 11.8.0.2                      0    nvidia
libnvjpeg-dev             11.8.0.2                      0    nvidia
libpng                    1.6.39               h8cc25b3_0
libtiff                   4.5.0                h6c2663c_2
libuv                     1.44.2               h2bbff1b_0
libwebp                   1.2.4                hbc33d0d_1
libwebp-base              1.2.4                h2bbff1b_1
llama                     0.0.1                     dev_0    <develop>
lz4-c                     1.9.4                h2bbff1b_0
markupsafe                2.1.1            py39h2bbff1b_0
mkl                       2023.1.0         h6b88ed4_46357
mkl-service               2.4.0            py39h2bbff1b_1
mkl_fft                   1.3.6            py39hf11a4ad_1
mkl_random                1.2.2            py39hf11a4ad_1
mpmath                    1.3.0            py39haa95532_0
networkx                  3.1              py39haa95532_0
numpy                     1.25.2           py39h055cbcc_0
numpy-base                1.25.2           py39h65a83cf_0
openssl                   3.0.10               h2bbff1b_0
pillow                    9.4.0            py39hd77b12b_0
pip                       23.2.1           py39haa95532_0
pycparser                 2.21               pyhd3eb1b0_0
pyopenssl                 23.2.0           py39haa95532_0
pysocks                   1.7.1            py39haa95532_0
python                    3.9.17               h1aa4202_0
pytorch                   2.0.1           py3.9_cuda11.7_cudnn8_0    pytorch
pytorch-cuda              11.7                 h16d0643_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
requests                  2.31.0           py39haa95532_0
sentencepiece             0.1.99                   pypi_0    pypi
setuptools                68.0.0           py39haa95532_0
six                       1.16.0                   pypi_0    pypi
sqlite                    3.41.2               h2bbff1b_0
sympy                     1.11.1           py39haa95532_0
tbb                       2021.8.0             h59b6b97_0
termcolor                 2.3.0                    pypi_0    pypi
tk                        8.6.12               h2bbff1b_0
torchaudio                2.0.2                    pypi_0    pypi
torchvision               0.15.2                   pypi_0    pypi
typing_extensions         4.7.1            py39haa95532_0
tzdata                    2023c                h04d1e81_0
urllib3                   1.26.16          py39haa95532_0
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
wheel                     0.38.4           py39haa95532_0
win_inet_pton             1.1.0            py39haa95532_0
xz                        5.4.2                h8cc25b3_0
zlib                      1.2.13               h8cc25b3_0
zstd                      1.5.5                hd43e919_0

And the environment

python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Microsoft Windows 10 Pro
GCC version: (GCC) 4.4.3
Clang version: 11.1.0
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.17 (main, Jul  5 2023, 20:47:11) [MSC v.1916 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19045-SP0
Is CUDA available: True
CUDA runtime version: 11.7.64
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4090
Nvidia driver version: 536.99
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=4501
DeviceID=CPU0
Family=107
L2CacheSize=16384
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=4501
Name=AMD Ryzen 9 7950X 16-Core Processor
ProcessorType=3
Revision=24834

Versions of relevant libraries:
[pip3] numpy==1.25.2
[pip3] torch==2.0.1
[pip3] torchaudio==2.0.2
[pip3] torchvision==0.15.2
[conda] blas                      1.0                         mkl
[conda] mkl                       2023.1.0         h6b88ed4_46357
[conda] mkl-service               2.4.0            py39h2bbff1b_1
[conda] mkl_fft                   1.3.6            py39hf11a4ad_1
[conda] mkl_random                1.2.2            py39hf11a4ad_1
[conda] numpy                     1.25.2           py39h055cbcc_0
[conda] numpy-base                1.25.2           py39h65a83cf_0
[conda] pytorch                   2.0.1           py3.9_cuda11.7_cudnn8_0    pytorch
[conda] pytorch-cuda              11.7                 h16d0643_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.0.2                    pypi_0    pypi
[conda] torchvision               0.15.2                   pypi_0    pypi

OK I just found I can change the example script to not use NCCL and this can work - but it seems to be pretty slow - is goo much slower than NCCL ?

Good to hear you’ve figured out that NCCL is not supported on Windows. Also, PyTorch won’t use Kubernetes by itself so your script must call into it.

Thanks for the response @ptrblck, now I know that kebernetes is not the default behavior, and is being called from my script somehow, I can figure out where its coming from. Looks like a package is being initialized with some settings that cause it to try to connect, as there is no explicit code in my script - but I will double check it.

Also in terms of execution speed, the first run of the example script took 50 seconds - which IMO is low performance for my HW, but the second time I ran it it, it only took 7 seconds. I presume on the first run it sets up some cache or at least initialiazes things and that this means after the first run, the system will be much faster.