I am running a 2 GPU (same node) training run. The script works:
- On two A40s
- On one H100
But fails very interestingly on 2 H100s. Not sure the source of the hardware dependance. The script successfully runs a pretraining evaluation, a bunch of training steps, a second evaluation, and then cannot collect gradients when training resumes. I am using the accelerate library. The trace, as well as my scripts and some configs are below.
accelerate launch pipeline/2.1_self_supervised_training.py
/kfs2/projects/metalsitenn/metal_site_modeling/equiformer/nets/layer_norm.py:89: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
@torch.cuda.amp.autocast(enabled=False)
/kfs2/projects/metalsitenn/metal_site_modeling/equiformer/nets/layer_norm.py:89: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
@torch.cuda.amp.autocast(enabled=False)
x3100c0s5b0n0:3347432:3347432 [0] NCCL INFO Bootstrap : Using hsn0:10.150.3.12<0>
x3100c0s5b0n0:3347432:3347432 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x3100c0s5b0n0:3347433:3347433 [1] NCCL INFO cudaDriverVersion 12040
x3100c0s5b0n0:3347433:3347433 [1] NCCL INFO Bootstrap : Using hsn0:10.150.3.12<0>
x3100c0s5b0n0:3347433:3347433 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
x3100c0s5b0n0:3347432:3347432 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda12.4
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO NET/IB : No device found.
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO NET/IB : No device found.
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO NET/Socket : Using [0]hsn0:10.150.3.12<0> [1]hsn1:10.150.1.122<0> [2]bond0:172.23.1.3<0>
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Using non-device net plugin version 0
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Using network Socket
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO NET/Socket : Using [0]hsn0:10.150.3.12<0> [1]hsn1:10.150.1.122<0> [2]bond0:172.23.1.3<0>
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Using non-device net plugin version 0
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Using network Socket
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO comm 0xaf7b270 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4000 commId 0x9d8f751b9e10c9be - Init START
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO comm 0xc0884c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 64000 commId 0x9d8f751b9e10c9be - Init START
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Setting affinity for GPU 1 to 01
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO comm 0xaf7b270 rank 0 nRanks 2 nNodes 1 localRanks 2 localRank 0 MNNVL 0
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 00/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 01/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 02/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 03/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 04/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 05/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 06/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 07/08 : 0 1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] -1/-1/-1->0->1 [7] -1/-1/-1->0->1
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO P2P Chunksize set to 524288
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO comm 0xc0884c0 rank 1 nRanks 2 nNodes 1 localRanks 2 localRank 1 MNNVL 0
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0 [2] 0/-1/-1->1->-1 [3] 0/-1/-1->1->-1 [4] -1/-1/-1->1->0 [5] -1/-1/-1->1->0 [6] 0/-1/-1->1->-1 [7] 0/-1/-1->1->-1
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO P2P Chunksize set to 524288
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 00/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 01/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 02/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 03/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 04/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 05/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 06/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Channel 07/0 : 1[1] -> 0[0] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Connected all rings
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO Connected all trees
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Connected all rings
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO Connected all trees
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO 8 coll channels, 0 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO 8 coll channels, 0 collnet channels, 0 nvls channels, 8 p2p channels, 8 p2p channels per peer
x3100c0s5b0n0:3347433:3349752 [1] NCCL INFO comm 0xc0884c0 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 64000 commId 0x9d8f751b9e10c9be - Init COMPLETE
x3100c0s5b0n0:3347432:3349753 [0] NCCL INFO comm 0xaf7b270 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 4000 commId 0x9d8f751b9e10c9be - Init COMPLETE
/projects/proteinml/.links/miniconda3/envs/metal2/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:1164: FutureWarning: 'n_iter' was renamed to 'max_iter' in version 1.5 and will be removed in 1.7.
warnings.warn(
20%|βββββββββββββββββββββββββββββ | 4/20 [00:08<00:41, 2.60s/it]/projects/proteinml/.links/miniconda3/envs/metal2/lib/python3.10/site-packages/dvc_render/vega.py:169: UserWarning: `generate_markdown` can only be used with `LinearTemplate`
warn("`generate_markdown` can only be used with `LinearTemplate`") # noqa: B028
45%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 9/20 [00:18<00:13, 1.20s/it]/projects/proteinml/.links/miniconda3/envs/metal2/lib/python3.10/site-packages/dvc_render/vega.py:169: UserWarning: `generate_markdown` can only be used with `LinearTemplate`
warn("`generate_markdown` can only be used with `LinearTemplate`") # noqa: B028
/projects/proteinml/.links/miniconda3/envs/metal2/lib/python3.10/site-packages/sklearn/manifold/_t_sne.py:1164: FutureWarning: 'n_iter' was renamed to 'max_iter' in version 1.5 and will be removed in 1.7.
warnings.warn(
50%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 10/20 [00:32<00:51, 5.10s/it]
Eventuallyβ¦
Rank 1] Timeout at NCCL work: 456, last enqueued NCCL work: 456, last completed NCCL work: 455.
[rank1]:[E205 14:27:18.083900057 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E205 14:27:18.083904965 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E205 14:27:19.451747711 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=456, OpType=ALLREDUCE, NumelIn=65297, NumelOut=65297, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f64e3b3af86 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f64e4e378d2 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f64e4e3e313 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f64e4e406fc in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x7f65325d6b65 in /kfs2/projects/proteinml/.links/miniconda3/envs/metal/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81ca (0x7f653423f1ca in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6533721e73 in /lib64/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=456, OpType=ALLREDUCE, NumelIn=65297, NumelOut=65297, Timeout(ms)=600000) ran for 600054 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f64e3b3af86 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f64e4e378d2 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f64e4e3e313 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f64e4e406fc in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b65 (0x7f65325d6b65 in /kfs2/projects/proteinml/.links/miniconda3/envs/metal/bin/../lib/libstdc++.so.6)
frame #5: <unknown function> + 0x81ca (0x7f653423f1ca in /lib64/libpthread.so.0)
frame #6: clone + 0x43 (0x7f6533721e73 in /lib64/libc.so.6)
Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f64e3b3af86 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe5aa84 (0x7f64e4ac9a84 in /projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b65 (0x7f65325d6b65 in /kfs2/projects/proteinml/.links/miniconda3/envs/metal/bin/../lib/libstdc++.so.6)
frame #3: <unknown function> + 0x81ca (0x7f653423f1ca in /lib64/libpthread.so.0)
frame #4: clone + 0x43 (0x7f6533721e73 in /lib64/libc.so.6)
W0205 14:27:28.725000 140287485417280 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 1075899 closing signal SIGTERM
E0205 14:27:28.964000 140287485417280 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: -6) local_rank: 1 (pid: 1075900) of binary: /projects/proteinml/.links/miniconda3/envs/metal/bin/python3.10
Traceback (most recent call last):
File "/projects/proteinml/.links/miniconda3/envs/metal/bin/accelerate", line 10, in <module>
sys.exit(main())
File "/projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1163, in launch_command
multi_gpu_launcher(args)
File "/projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/accelerate/commands/launch.py", line 792, in multi_gpu_launcher
distrib_run.run(args)
File "/projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/projects/proteinml/.links/miniconda3/envs/metal/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
========================================================
pipeline/2.1_self_supervised_training.py FAILED
--------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
--------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-02-05_14:27:28
host : x3100c0s5b0n0.head.cm.kestrel.hpc.nrel.gov
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 1075900)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 1075900
========================================================
The trainer I wrote and am calling:
(also in a comment)
Accelerate config:
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: MULTI_GPU
downcast_bf16: 'no'
enable_cpu_affinity: false
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
My environment (loading cuda 12.1 from elsewhere on my cluster):
(in a comment because I am hitting max length.)
NCCL tests:
./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 1512849 on x3100c0s5b0n0 device 0 [0x04] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 1512849 on x3100c0s5b0n0 device 1 [0x64] NVIDIA H100 80GB HBM3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 7.09 0.00 0.00 0 7.33 0.00 0.00 0
16 4 float sum -1 7.40 0.00 0.00 0 7.36 0.00 0.00 0
32 8 float sum -1 7.32 0.00 0.00 0 7.29 0.00 0.00 0
64 16 float sum -1 7.55 0.01 0.01 0 7.32 0.01 0.01 0
128 32 float sum -1 7.40 0.02 0.02 0 7.34 0.02 0.02 0
256 64 float sum -1 7.47 0.03 0.03 0 7.35 0.03 0.03 0
512 128 float sum -1 7.31 0.07 0.07 0 7.26 0.07 0.07 0
1024 256 float sum -1 7.77 0.13 0.13 0 7.56 0.14 0.14 0
2048 512 float sum -1 7.80 0.26 0.26 0 7.69 0.27 0.27 0
4096 1024 float sum -1 8.03 0.51 0.51 0 7.80 0.53 0.53 0
8192 2048 float sum -1 8.36 0.98 0.98 0 8.13 1.01 1.01 0
16384 4096 float sum -1 8.55 1.92 1.92 0 8.26 1.98 1.98 0
32768 8192 float sum -1 8.65 3.79 3.79 0 8.51 3.85 3.85 0
65536 16384 float sum -1 9.02 7.26 7.26 0 8.41 7.79 7.79 0
131072 32768 float sum -1 10.14 12.92 12.92 0 9.77 13.41 13.41 0
262144 65536 float sum -1 12.83 20.43 20.43 0 11.84 22.15 22.15 0
524288 131072 float sum -1 24.62 21.30 21.30 0 25.45 20.60 20.60 0
1048576 262144 float sum -1 28.37 36.96 36.96 0 28.29 37.07 37.07 0
2097152 524288 float sum -1 36.02 58.23 58.23 0 36.02 58.23 58.23 0
4194304 1048576 float sum -1 52.13 80.47 80.47 0 51.97 80.71 80.71 0
8388608 2097152 float sum -1 86.74 96.71 96.71 0 86.53 96.95 96.95 0
16777216 4194304 float sum -1 158.8 105.64 105.64 0 155.5 107.88 107.88 0
33554432 8388608 float sum -1 297.4 112.83 112.83 0 298.1 112.57 112.57 0
67108864 16777216 float sum -1 570.2 117.69 117.69 0 569.1 117.93 117.93 0
134217728 33554432 float sum -1 1109.8 120.94 120.94 0 1107.4 121.20 121.20 0
# Out of bounds values : 0 OK
# A```
Much appreciation for any guidance, been beating my head on this one for a good day and only have so much time allocated to this project.