Torch All Reduce Fails w/t No CPP Stack Trace

Problem

I’m debugging why a cluster on h100 gke fails and running in one single node currently. All torch dist calls fail but torch.cuda works fine. This only happens on nccl but not gloo. This is based off a very simple example and fails on barrier and all_reduce but not torch.cuda.synchronize. Additionally there are cpp stack traces even when it’s indicated one is being exported.

Reproducible code

import os
import torch
import torch.distributed as dist
import torch.multiprocessing as mp


def worker(rank):
    dist.init_process_group("nccl", rank=rank, world_size=2)
    torch.cuda.set_device(rank)
    tensor = torch.randn(10).cuda()
    dist.all_reduce(tensor)
    torch.cuda.synchronize(device=rank)


if __name__ == "__main__":
    os.environ["MASTER_ADDR"] = "localhost"
    os.environ["MASTER_PORT"] = "23456"
    os.environ["TORCH_CPP_LOG_LEVEL"]="INFO"
    os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
    mp.spawn(worker, nprocs=2, args=())

Stack Trace

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:23456.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:23456.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 23456.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 23456).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:23456.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 23456).
[I socket.cpp:299] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [localhost]:35336.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:23456 on [localhost]:35336.
[I TCPStore.cpp:343] [c10d - debug] TCP client connected to host localhost:23456
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:23456.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:23456 on [localhost]:35346.
[I TCPStore.cpp:343] [c10d - debug] TCP client connected to host localhost:23456
[I socket.cpp:299] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [localhost]:35346.
[I ProcessGroupNCCL.cpp:804] [PG 0 Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.20.5, size: 2, global rank: 1, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 10000000, ID=92507696
[I ProcessGroupNCCL.cpp:804] [PG 0 Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.20.5, size: 2, global rank: 0, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 10000000, ID=1160576752
[rank0]:[I ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:4882:4882 [0] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:4882:4882 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:4882:4882 [0] NCCL INFO Bootstrap : Using eth0:10.64.7.9<0>
[rank1]:[W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

W0110 20:17:57.164000 132674338096960 torch/multiprocessing/spawn.py:145] Terminating process 4883 via signal SIGTERM

Traceback (most recent call last):
  File "/app/pre-train/unmasked_teacher/single_modality/basic.py", line 20, in <module>
    mp.spawn(worker, nprocs=2, args=())
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 177, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

Nvidia outputs

Fri Jan 10 20:19:05 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA H100 80GB HBM3          Off |   00000000:04:00.0 Off |                    0 |
| N/A   31C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA H100 80GB HBM3          Off |   00000000:05:00.0 Off |                    0 |
| N/A   30C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA H100 80GB HBM3          Off |   00000000:0A:00.0 Off |                    0 |
| N/A   32C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA H100 80GB HBM3          Off |   00000000:0B:00.0 Off |                    0 |
| N/A   29C    P0             75W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA H100 80GB HBM3          Off |   00000000:84:00.0 Off |                    0 |
| N/A   30C    P0             68W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA H100 80GB HBM3          Off |   00000000:85:00.0 Off |                    0 |
| N/A   28C    P0             67W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA H100 80GB HBM3          Off |   00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0             69W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA H100 80GB HBM3          Off |   00000000:8B:00.0 Off |                    0 |
| N/A   28C    P0             67W /  700W |       1MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I’m currently only using gpu 0 and 1 and have the CUDA_VISIBLE_DEVICES set to 0,1 so 2 - 7 don’t get activated.

Torch Version: 2.3.1
Full Python Deps

anaconda-anon-usage @ file:///croot/anaconda-anon-usage_1710965072196/work
annotated-types==0.7.0
anykeystore==0.2
apex==0.9.10.dev0
archspec @ file:///croot/archspec_1709217642129/work
asttokens @ file:///opt/conda/conda-bld/asttokens_1646925590279/work
astunparse==1.6.3
attrs @ file:///croot/attrs_1695717823297/work
av==10.0.0
beautifulsoup4 @ file:///croot/beautifulsoup4-split_1681493039619/work
boltons @ file:///croot/boltons_1677628692245/work
braceexpand==0.1.7
Brotli @ file:///croot/brotli-split_1714483155106/work
cachetools==5.5.0
certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1725278078093/work/certifi
cffi @ file:///croot/cffi_1714483155441/work
chardet @ file:///home/builder/ci_310/chardet_1640804867535/work
charset-normalizer @ file:///tmp/build/80754af9/charset-normalizer_1630003229654/work
click @ file:///croot/click_1698129812380/work
conda @ file:///croot/conda_1689269889729/work
conda-build @ file:///croot/conda-build_1710789183177/work
conda-content-trust @ file:///croot/conda-content-trust_1714483159009/work
conda-libmamba-solver @ file:///croot/conda-libmamba-solver_1691418897561/work/src
conda-package-handling @ file:///croot/conda-package-handling_1714483155348/work
conda_index @ file:///croot/conda-index_1706633791028/work
conda_package_streaming @ file:///croot/conda-package-streaming_1690987966409/work
cryptacular==1.6.2
cryptography @ file:///croot/cryptography_1714660666131/work
decorator @ file:///opt/conda/conda-bld/decorator_1643638310831/work
decord==0.6.0
deepspeed==0.15.2
defusedxml==0.7.1
distro @ file:///croot/distro_1714488253808/work
dnspython==2.6.1
docker-pycreds==0.4.0
einops==0.6.1
exceptiongroup @ file:///croot/exceptiongroup_1706031385326/work
executing @ file:///opt/conda/conda-bld/executing_1646925071911/work
expecttest==0.2.1
filelock @ file:///croot/filelock_1700591183607/work
frozendict @ file:///croot/frozendict_1713194832637/work
fsspec==2024.6.0
ftfy==6.1.1
fvcore==0.1.5.post20220512
gitdb==4.0.11
GitPython==3.1.43
gmpy2 @ file:///tmp/build/80754af9/gmpy2_1645455533097/work
google-api-core==2.21.0
google-auth==2.35.0
google-cloud-core==2.4.1
google-cloud-storage==2.18.2
google-crc32c==1.6.0
google-resumable-media==2.7.2
googleapis-common-protos==1.65.0
greenlet==3.1.1
hjson==3.1.0
huggingface-hub==0.26.1
hupper==1.12.1
hypothesis==6.103.0
idna @ file:///croot/idna_1714398848350/work
imageio==2.36.0
iopath==0.1.10
ipython @ file:///croot/ipython_1704833016303/work
jedi @ file:///tmp/build/80754af9/jedi_1644315229345/work
Jinja2 @ file:///croot/jinja2_1716993405101/work
jsonpatch @ file:///croot/jsonpatch_1714483231291/work
jsonpointer==2.1
jsonschema @ file:///croot/jsonschema_1699041609003/work
jsonschema-specifications @ file:///croot/jsonschema-specifications_1699032386549/work
lark==1.1.9
libarchive-c @ file:///tmp/build/80754af9/python-libarchive-c_1617780486945/work
libmambapy @ file:///croot/mamba-split_1714483352891/work/libmambapy
MarkupSafe @ file:///croot/markupsafe_1704205993651/work
matplotlib-inline @ file:///opt/conda/conda-bld/matplotlib-inline_1662014470464/work
menuinst @ file:///croot/menuinst_1716404372721/work
mkl-fft @ file:///croot/mkl_fft_1695058164594/work
mkl-random @ file:///croot/mkl_random_1695059800811/work
mkl-service==2.4.0
more-itertools @ file:///croot/more-itertools_1700662129964/work
mpi4py @ file:///croot/mpi4py_1671223370575/work
mpmath @ file:///croot/mpmath_1690848262763/work
msgpack==1.1.0
networkx @ file:///croot/networkx_1717597493534/work
ninja==1.11.1.1
numpy==1.21.6
oauthlib==3.2.2
opencv-python==4.10.0.84
optree==0.11.0
packaging @ file:///croot/packaging_1710807400464/work
pandas==1.3.5
parso @ file:///opt/conda/conda-bld/parso_1641458642106/work
PasteDeploy==3.1.0
pbkdf2==1.3
pexpect @ file:///tmp/build/80754af9/pexpect_1605563209008/work
Pillow==9.2.0
pkginfo @ file:///croot/pkginfo_1715695984887/work
plaster==1.1.2
plaster-pastedeploy==1.0.1
platformdirs @ file:///croot/platformdirs_1692205439124/work
pluggy @ file:///tmp/build/80754af9/pluggy_1648024709248/work
portalocker==2.10.1
prompt-toolkit @ file:///croot/prompt-toolkit_1704404351921/work
proto-plus==1.25.0
protobuf==5.28.3
psutil @ file:///opt/conda/conda-bld/psutil_1656431268089/work
ptyprocess @ file:///tmp/build/80754af9/ptyprocess_1609355006118/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
pure-eval @ file:///opt/conda/conda-bld/pure_eval_1646925070566/work
py-cpuinfo==9.0.0
pyasn1==0.6.1
pyasn1_modules==0.4.1
pycosat @ file:///croot/pycosat_1714510623388/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==2.9.2
pydantic_core==2.23.4
Pygments @ file:///croot/pygments_1684279966437/work
pyOpenSSL @ file:///croot/pyopenssl_1708380408460/work
pyramid==2.0.2
pyramid-mailer==0.15.1
PySocks @ file:///home/builder/ci_310/pysocks_1640793678128/work
python-dateutil==2.9.0.post0
python-etcd==0.4.5
python3-openid==3.2.0
pytz @ file:///croot/pytz_1713974312559/work
PyWavelets==1.4.1
PyYAML @ file:///croot/pyyaml_1698096049011/work
referencing @ file:///croot/referencing_1699012038513/work
regex==2023.5.5
repoze.sendmail==4.4.1
requests @ file:///croot/requests_1716902831423/work
requests-oauthlib==2.0.0
rpds-py @ file:///croot/rpds-py_1698945930462/work
rsa==4.9
ruamel.yaml @ file:///croot/ruamel.yaml_1666304550667/work
ruamel.yaml.clib @ file:///croot/ruamel.yaml.clib_1666302247304/work
safetensors==0.4.5
scikit-image==0.19.3
scipy==1.7.3
sentry-sdk==2.17.0
setproctitle==1.3.3
six @ file:///tmp/build/80754af9/six_1644875935023/work
smmap==5.0.1
sortedcontainers==2.4.0
soupsieve @ file:///croot/soupsieve_1696347547217/work
SQLAlchemy==2.0.36
stack-data @ file:///opt/conda/conda-bld/stack_data_1646927590127/work
sympy==1.12.1
tabulate==0.9.0
tensorboardX==2.6.2.2
termcolor==2.5.0
tifffile==2024.9.20
timm==0.4.12
tokenizers==0.20.1
tomli @ file:///opt/conda/conda-bld/tomli_1657175507142/work
toolz @ file:///croot/toolz_1667464077321/work
torch==2.3.1
torchaudio==2.3.1
torchelastic==0.2.2
torchvision==0.18.1
tqdm @ file:///croot/tqdm_1716395931952/work
traitlets @ file:///croot/traitlets_1671143879854/work
transaction==5.0
transformers==4.46.0
translationstring==1.4
triton==2.3.1
truststore @ file:///croot/truststore_1695244293384/work
types-dataclasses==0.6.6
typing_extensions @ file:///croot/typing_extensions_1715268824938/work
urllib3 @ file:///croot/urllib3_1715635851070/work
velruse==1.1.1
venusian==3.1.0
wandb==0.18.5
wcwidth @ file:///Users/ktietz/demo/mc3/conda-bld/wcwidth_1629357192024/work
webdataset==0.2.100
WebOb==1.8.9
WTForms==3.2.1
wtforms-recaptcha==0.3.2
yacs==0.1.8
zope.deprecation==5.0
zope.interface==7.1.1
zope.sqlalchemy==3.1
zstandard @ file:///croot/zstandard_1714677652653/work

Thoughts

I’m at the assumption there’s a conflic between the torch cpp libraries and the driver versions on the gpus currently. I can’t change those however given they’re locked behind gke. The so files are definitely called so the ld_library path is correctly set. It looks like I hit a problem that doesn’t have a stack trace deeper in those so files. Interestingly when I run on 1 gpu it also fails.

I have another cluster setup where this works properly but there’s no difference in my tf for each, I originally thought it was networking problems but all the torch procs are able to create a socket connection to the master_addr and master_port.

Any help would be appreciated I don’t know if I got to a deep bug or not.

Why would this be the case? Did you see any driver related issues?

Could you also check the output logs from NCCL_DEBUG=INFO to see if NCCL raises any errors?

The above stack trace is unfortunately with NCCL_DEBUG=INFO, I essentially get no logs from either torch or nccl, sorry actually I added the stack trace below, I got them mixed. The driver thought is just from past problems, I have almost nothing to go off given the lack of logs and that’s previously caused pain. What makes me skeptical is that I don’t get any change in the gpu usage or gpu mem alloc so it appears as if the gpus aren’t even receiving data.

I also assumed it could be happening because the torch init wasn’t contacting the gpus and there was a problem with the networking I set up on each node but gloo passes so I ruled that out. I disabled p2p as well and that didn’t remedy the problem. I thought it might be a torch problem since gloo and torch.cuda.synchronize all work fine and the nccl setup correctly sets up the nodes.

[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I debug.cpp:49] [c10d] The debug level is set to DETAIL.
[I socket.cpp:480] [c10d - debug] The server socket will attempt to listen on an IPv6 address.
[I socket.cpp:531] [c10d - debug] The server socket is attempting to listen on [::]:23456.
[I socket.cpp:605] [c10d] The server socket has started to listen on [::]:23456.
[I TCPStore.cpp:305] [c10d - debug] The server has started on port = 23456.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 23456).
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:23456.
[I socket.cpp:720] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (localhost, 23456).
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:23456 on [localhost]:60378.
[I TCPStore.cpp:343] [c10d - debug] TCP client connected to host localhost:23456
[I socket.cpp:299] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [localhost]:60378.
[I socket.cpp:796] [c10d - trace] The client socket is attempting to connect to [localhost]:23456.
[I socket.cpp:299] [c10d - debug] The server socket on [::]:23456 has accepted a connection from [localhost]:60382.
[I socket.cpp:884] [c10d] The client socket has connected to [localhost]:23456 on [localhost]:60382.
[I TCPStore.cpp:343] [c10d - debug] TCP client connected to host localhost:23456
[I ProcessGroupNCCL.cpp:804] [PG 0 Rank 1] ProcessGroupNCCL initialization options: NCCL version: 2.20.5, size: 2, global rank: 1, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 10000000, ID=655132336
[I ProcessGroupNCCL.cpp:804] [PG 0 Rank 0] ProcessGroupNCCL initialization options: NCCL version: 2.20.5, size: 2, global rank: 0, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 1, TORCH_NCCL_ENABLE_TIMING: 1, TORCH_NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 600000, USE_HIGH_PRIORITY_STREAM: 0, SPLIT_FROM: 0, SPLIT_COLOR: 0, TORCH_DISTRIBUTED_DEBUG: DETAIL, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 1, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 600, TORCH_NCCL_TRACE_BUFFER_SIZE: 0, TORCH_NCCL_COORD_CHECK_MILSEC: 10000000, ID=104092912
[rank0]:[I ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=0, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:5163:5163 [0] NCCL INFO NCCL_SOCKET_FAMILY set by environment to AF_INET
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:5163:5163 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth
run-34dad24e-4ee3-47a7-95e6-5120db024e24-master-0:5163:5163 [0] NCCL INFO Bootstrap : Using eth0:10.64.7.9<0>
[rank1]:[W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

W0110 20:54:38.090000 135169662330688 torch/multiprocessing/spawn.py:145] Terminating process 5164 via signal SIGTERM
Traceback (most recent call last):
  File "/app/pre-train/unmasked_teacher/single_modality/basic.py", line 20, in <module>
    mp.spawn(worker, nprocs=2, args=())
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/opt/conda/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 177, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

I used the launch blocking and got

[rank0]:[I ProcessGroupWrapper.cpp:587] [Rank 0] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank1]:[I ProcessGroupWrapper.cpp:587] [Rank 1] Running collective: CollectiveFingerPrint(SequenceNumber=1, OpType=ALLREDUCE, TensorShape=[10], TensorDtypes=Float, TensorDeviceTypes=TensorOptions(dtype=float (default), device=cuda, layout=Strided (default), requires_grad=false (default), pinned_memory=false (default), memory_format=(nullopt)))
[rank0]:[I ProcessGroupNCCL.cpp:1109] [PG 0 Rank 0] ProcessGroupNCCL destructor entered.
[rank0]:[I ProcessGroupNCCL.cpp:1094] [PG 0 Rank 0] Launching ProcessGroupNCCL abort asynchrounously.
[rank0]:[I ProcessGroupNCCL.cpp:1041] [PG 0 Rank 0] ProcessGroupNCCL destroying ncclComm_ 0x28cd5920 on CUDA device: 0
[rank0]:[I NCCLUtils.hpp:371] Aborting ncclComm_ 0x28cd5920 with reason: No abort reason provided.
[rank1]:[I ProcessGroupNCCL.cpp:1109] [PG 0 Rank 1] ProcessGroupNCCL destructor entered.
[rank1]:[I ProcessGroupNCCL.cpp:1094] [PG 0 Rank 1] Launching ProcessGroupNCCL abort asynchrounously.
run-34857d21-b64a-44f9-800f-a9e57f2ab1c5-q6pv7:2651:2797 [0] NCCL INFO [Service thread] Connection closed by localRank 0
[rank1]:[I ProcessGroupNCCL.cpp:1041] [PG 0 Rank 1] ProcessGroupNCCL destroying ncclComm_ 0x138532e0 on CUDA device: 1
[rank1]:[I NCCLUtils.hpp:371] Aborting ncclComm_ 0x138532e0 with reason: No abort reason provided.
run-34857d21-b64a-44f9-800f-a9e57f2ab1c5-q6pv7:2652:2798 [1] NCCL INFO [Service thread] Connection closed by localRank 1
run-34857d21-b64a-44f9-800f-a9e57f2ab1c5-q6pv7:2651:2803 [0] NCCL INFO comm 0x28cd5920 rank 0 nranks 2 cudaDev 0 busId 50 - Abort COMPLETE
[rank0]:[I ProcessGroupNCCL.cpp:1060] [PG 0 Rank 0] ProcessGroupNCCL destroyed  communicator on CUDA device: 0 with stream: 3
[rank0]:[I ProcessGroupNCCL.cpp:999] [PG 0 Rank 0] future is successfully executed for: ProcessGroup abort
[rank0]:[I ProcessGroupNCCL.cpp:1100] [PG 0 Rank 0] ProcessGroupNCCL aborts successfully.
[rank0]:[I ProcessGroupNCCL.cpp:1132] [PG 0 Rank 0] ProcessGroupNCCL watchdog thread joined.
[rank0]:[I ProcessGroupNCCL.cpp:1136] [PG 0 Rank 0] ProcessGroupNCCL heart beat monitor thread joined.
[rank0]:[W Module.cpp:160] symbolizing C++ stack trace for exception; if this hangs, rerun with TORCH_DISABLE_ADDR2LINE=1...

run-34857d21-b64a-44f9-800f-a9e57f2ab1c5-q6pv7:2652:2804 [0] NCCL INFO comm 0x138532e0 rank 1 nranks 2 cudaDev 1 busId 60 - Abort COMPLETE
[rank1]:[I ProcessGroupNCCL.cpp:1060] [PG 0 Rank 1] ProcessGroupNCCL destroyed  communicator on CUDA device: 1 with stream: 3
[rank1]:[I ProcessGroupNCCL.cpp:999] [PG 0 Rank 1] future is successfully executed for: ProcessGroup abort
[rank1]:[I ProcessGroupNCCL.cpp:1100] [PG 0 Rank 1] ProcessGroupNCCL aborts successfully.
[rank1]:[I ProcessGroupNCCL.cpp:1132] [PG 0 Rank 1] ProcessGroupNCCL watchdog thread joined.
[rank1]:[I ProcessGroupNCCL.cpp:1136] [PG 0 Rank 1] ProcessGroupNCCL heart beat monitor thread joined.

Not much to go on lol