Problem
I’m debugging why a cluster on h100 gke fails and running in one single node currently. All torch dist calls fail but torch.cuda works fine. This only happens on nccl but not gloo. This is based off a very simple example and fails on barrier and all_reduce but not torch.cuda.synchronize. Additionally there are cpp stack traces even when it’s indicated one is being exported.
Reproducible code
import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp
def worker(rank):
dist.init_process_group("nccl", rank=rank, world_size=2)
torch.cuda.set_device(rank)
tensor = torch.randn(10).cuda()
dist.all_reduce(tensor)
torch.cuda.synchronize(device=rank)
if __name__ == "__main__":
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "23456"
os.environ["TORCH_CPP_LOG_LEVEL"]="INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
mp.spawn(worker, nprocs=2, args=())
Stack Trace
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:23456.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:23456.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 23456.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 23456).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:23456.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 23456).
[I socket.cpp:299] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [localhost]:35336.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:23456 on [localhost]:35336.
[I TCPStore.cpp:343] [c10d - debug] TCP client connected to host localhost:23456
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:23456.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:23456 on [localhost]:35346.
[I TCPStore.cpp:343] [c10d - debug] TCP client connected to host localhost:23456
[I socket.cpp:299] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [localhost]:35346.
[I ProcessGroupNCCL.cpp:804] [PG 0 Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.20.5, size: 2, global rank: 1, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 10000000, ID=92507696
[I ProcessGroupNCCL.cpp:804] [PG 0 Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.20.5, size: 2, global rank: 0, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 10000000, ID=1160576752
[rank0]:[I ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:4882:4882 [0] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:4882:4882 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:4882:4882 [0] NCCL INFO Bootstrap : Using eth0:10.64.7.9<0>
[rank1]:[W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...
W0110 20:17:57.164000 132674338096960 torch/multiprocessing/spawn.py:145] Terminating process 4883 via signal SIGTERM
Traceback (most recent call last):
File "/app/pre-train/unmasked_teacher/single_modality/basic.py", line 20, in <module>
mp.spawn(worker, nprocs=2, args=())
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
while not context.join():
File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 177, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1
Nvidia outputs
Fri Jan 10 20:19:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA H100 80GB HBM3 Off | 00000000:04:00.0 Off | 0 |
| N/A 31C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 Off | 00000000:05:00.0 Off | 0 |
| N/A 30C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 Off | 00000000:0A:00.0 Off | 0 |
| N/A 32C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 Off | 00000000:0B:00.0 Off | 0 |
| N/A 29C P0 75W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 Off | 00000000:84:00.0 Off | 0 |
| N/A 30C P0 68W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 Off | 00000000:85:00.0 Off | 0 |
| N/A 28C P0 67W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 Off | 00000000:8A:00.0 Off | 0 |
| N/A 31C P0 69W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 Off | 00000000:8B:00.0 Off | 0 |
| N/A 28C P0 67W / 700W | 1MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
I’m currently only using gpu 0 and 1 and have the CUDA_VISIBLE_DEVICES set to 0,1 so 2 - 7 don’t get activated.
Torch Version: 2.3.1
Full Python Deps
anaconda-anon-usage @ file:///croot/anaconda-anon-usage_1710965072196/work
annotated-types==0.7.0
anykeystore==0.2
apex==0.9.10.dev0
archspec @ file:///croot/archspec_1709217642129/work
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
astunparse==1.6.3
attrs @ file:///croot/attrs_1695717823297/work
av==10.0.0
beautifulsoup4 @ file:///croot/beautifulsoup4-split_1681493039619/work
boltons @ file:///croot/boltons_1677628692245/work
braceexpand==0.1.7
Brotli @ file:///croot/brotli-split_1714483155106/work
cachetools==5.5.0
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1725278078093/work/certifi
cffi @ file:///croot/cffi_1714483155441/work
chardet @ file:///home/builder/ci_310/chardet_1640804867535/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click @ file:///croot/click_1698129812380/work
conda @ file:///croot/conda_1689269889729/work
conda-build @ file:///croot/conda-build_1710789183177/work
conda-content-trust @ file:///croot/conda-content-trust_1714483159009/work
conda-libmamba-solver @ file:///croot/conda-libmamba-solver_1691418897561/work/src
conda-package-handling @ file:///croot/conda-package-handling_1714483155348/work
conda_index @ file:///croot/conda-index_1706633791028/work
conda_package_streaming @ file:///croot/conda-package-streaming_1690987966409/work
cryptacular==1.6.2
cryptography @ file:///croot/cryptography_1714660666131/work
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
decord==0.6.0
deepspeed==0.15.2
defusedxml==0.7.1
distro @ file:///croot/distro_1714488253808/work
dnspython==2.6.1
docker-pycreds==0.4.0
einops==0.6.1
exceptiongroup @ file:///croot/exceptiongroup_1706031385326/work
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
expecttest==0.2.1
filelock @ file:///croot/filelock_1700591183607/work
frozendict @ file:///croot/frozendict_1713194832637/work
fsspec==2024.6.0
ftfy==6.1.1
fvcore==0.1.5.post20220512
gitdb==4.0.11
GitPython==3.1.43
gmpy2 @ file:///tmp/build/80754af9/gmpy2_1645455533097/work
google-api-core==2.21.0
google-auth==2.35.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.65.0
greenlet==3.1.1
hjson==3.1.0
huggingface-hub==0.26.1
hupper==1.12.1
hypothesis==6.103.0
idna @ file:///croot/idna_1714398848350/work
imageio==2.36.0
iopath==0.1.10
ipython @ file:///croot/ipython_1704833016303/work
jedi @ file:///tmp/build/80754af9/jedi_1644315229345/work
Jinja2 @ file:///croot/jinja2_1716993405101/work
jsonpatch @ file:///croot/jsonpatch_1714483231291/work
jsonpointer==2.1
jsonschema @ file:///croot/jsonschema_1699041609003/work
jsonschema-specifications @ file:///croot/jsonschema-specifications_1699032386549/work
lark==1.1.9
libarchive-c @ file:///tmp/build/80754af9/python-libarchive-c_1617780486945/work
libmambapy @ file:///croot/mamba-split_1714483352891/work/libmambapy
MarkupSafe @ file:///croot/markupsafe_1704205993651/work
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
menuinst @ file:///croot/menuinst_1716404372721/work
mkl-fft @ file:///croot/mkl_fft_1695058164594/work
mkl-random @ file:///croot/mkl_random_1695059800811/work
mkl-service==2.4.0
more-itertools @ file:///croot/more-itertools_1700662129964/work
mpi4py @ file:///croot/mpi4py_1671223370575/work
mpmath @ file:///croot/mpmath_1690848262763/work
msgpack==1.1.0
networkx @ file:///croot/networkx_1717597493534/work
ninja==1.11.1.1
numpy==1.21.6
oauthlib==3.2.2
opencv-python==4.10.0.84
optree==0.11.0
packaging @ file:///croot/packaging_1710807400464/work
pandas==1.3.5
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
PasteDeploy==3.1.0
pbkdf2==1.3
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
Pillow==9.2.0
pkginfo @ file:///croot/pkginfo_1715695984887/work
plaster==1.1.2
plaster-pastedeploy==1.0.1
platformdirs @ file:///croot/platformdirs_1692205439124/work
pluggy @ file:///tmp/build/80754af9/pluggy_1648024709248/work
portalocker==2.10.1
prompt-toolkit @ file:///croot/prompt-toolkit_1704404351921/work
proto-plus==1.25.0
protobuf==5.28.3
psutil @ file:///opt/conda/conda-bld/psutil_1656431268089/work
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
py-cpuinfo==9.0.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycosat @ file:///croot/pycosat_1714510623388/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==2.9.2
pydantic_core==2.23.4
Pygments @ file:///croot/pygments_1684279966437/work
pyOpenSSL @ file:///croot/pyopenssl_1708380408460/work
pyramid==2.0.2
pyramid-mailer==0.15.1
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.9.0.post0
python-etcd==0.4.5
python3-openid==3.2.0
pytz @ file:///croot/pytz_1713974312559/work
PyWavelets==1.4.1
PyYAML @ file:///croot/pyyaml_1698096049011/work
referencing @ file:///croot/referencing_1699012038513/work
regex==2023.5.5
repoze.sendmail==4.4.1
requests @ file:///croot/requests_1716902831423/work
requests-oauthlib==2.0.0
rpds-py @ file:///croot/rpds-py_1698945930462/work
rsa==4.9
ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work
ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work
safetensors==0.4.5
scikit-image==0.19.3
scipy==1.7.3
sentry-sdk==2.17.0
setproctitle==1.3.3
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sortedcontainers==2.4.0
soupsieve @ file:///croot/soupsieve_1696347547217/work
SQLAlchemy==2.0.36
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
sympy==1.12.1
tabulate==0.9.0
tensorboardX==2.6.2.2
termcolor==2.5.0
tifffile==2024.9.20
timm==0.4.12
tokenizers==0.20.1
tomli @ file:///opt/conda/conda-bld/tomli_1657175507142/work
toolz @ file:///croot/toolz_1667464077321/work
torch==2.3.1
torchaudio==2.3.1
torchelastic==0.2.2
torchvision==0.18.1
tqdm @ file:///croot/tqdm_1716395931952/work
traitlets @ file:///croot/traitlets_1671143879854/work
transaction==5.0
transformers==4.46.0
translationstring==1.4
triton==2.3.1
truststore @ file:///croot/truststore_1695244293384/work
types-dataclasses==0.6.6
typing_extensions @ file:///croot/typing_extensions_1715268824938/work
urllib3 @ file:///croot/urllib3_1715635851070/work
velruse==1.1.1
venusian==3.1.0
wandb==0.18.5
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
webdataset==0.2.100
WebOb==1.8.9
WTForms==3.2.1
wtforms-recaptcha==0.3.2
yacs==0.1.8
zope.deprecation==5.0
zope.interface==7.1.1
zope.sqlalchemy==3.1
zstandard @ file:///croot/zstandard_1714677652653/work
Thoughts
I’m at the assumption there’s a conflic between the torch cpp libraries and the driver versions on the gpus currently. I can’t change those however given they’re locked behind gke. The so files are definitely called so the ld_library path is correctly set. It looks like I hit a problem that doesn’t have a stack trace deeper in those so files. Interestingly when I run on 1 gpu it also fails.
I have another cluster setup where this works properly but there’s no difference in my tf for each, I originally thought it was networking problems but all the torch procs are able to create a socket connection to the master_addr and master_port.
Any help would be appreciated I don’t know if I got to a deep bug or not.