GPU startup is way too slow!

Hi,

I’m using NVIDIA RTX A6000 GPUs to run an NLP task (using Transformers library). The problem is that when I want to perform training in a multi-GPU environment (i.e., dual GPU), the system just gets stuck before training starts and makes no progress as far as it shows in the log. I’m not sure which part is problematic since I recently have upgraded my GPU cluster. – I even managed to see when it progresses the training, but after 5-6 hours, as I saw no change, I stopped the script.

Following are the system and packages info:

NVIDIA-SMI 510.60.02
Driver Version: 510.60.02
CUDA Version: 11.6
Pytorch: 1.8.2
Arch version: SM_75, compute_75

Based on your output I assume you’ve built PyTorch from source for Turing GPUs (architectures sm_75).
If so, add sm_80 and sm_86 to your build and it should work on your Ampere GPU.

Thanks for your answer. I’m actually not installing pytorch from source. Do I need to install it from source to enable this? If so, how can I add sm_80 and sm_86 to my build? I couldn’t find any documentation on this.

No, you don’t need to build from source to enable your Ampere GPU and I’m not sure how to interpret:

in this case.
The 1.8.2+cu111 binaries are built for your compute capability and work fine in my setup (using the same GPU architecture):

>>> import torch
>>> torch.__version__
'1.8.2+cu111'
>>> torch.cuda.get_arch_list()
['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
>>> torch.randn(1).cuda()
tensor([1.5997], device='cuda:0')

Thanks, your provided code works fine on my machine as well. Just checked it and saw it includes sm_80 and sm_86 as well. The problem, though, still exists in multi-GPU environment. Here is the running logs of my code with the Transformers library.

code:
python3 -m torch.distributed.launch --nproc_per_node=2 training_script.py --[args]

logs:

[INFO|modeling_utils.py:1431] 2022-04-03 10:15:19,610 >> loading weights file https://huggingface.co/allenai/led-base-16384/resolve/main/pytorch_model.bin from cache at /home/sajad/.cache/huggingface/transformers/c8f7e4603efbc329ce921b34057d78880dead50f45b2a1648b3a06ca6eb17f51.201222b06d46289037a8dccc57548abc8eb81ba042d3762214ac15c9691ff8c7
[INFO|modeling_utils.py:1698] 2022-04-03 10:15:21,212 >> All model checkpoint weights were used when initializing LEDForConditionalGeneration.

[INFO|modeling_utils.py:1706] 2022-04-03 10:15:21,213 >> All the weights of LEDForConditionalGeneration were initialized from the model checkpoint at allenai/led-base-16384.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LEDForConditionalGeneration for predictions without further training.
04/03/2022 10:15:21 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/sajad/.cache/huggingface/datasets/json/default-528e7879b13129b0/0.0.0/ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b/cache-4067e49ab36511c2.arrow
chianti:24372:24372 [0] NCCL INFO Bootstrap : Using enp8s0:192.168.10.202<0>
chianti:24372:24372 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

chianti:24372:24372 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
chianti:24372:24372 [0] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.10.202<0>
chianti:24372:24372 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.1
chianti:24373:24373 [1] NCCL INFO Bootstrap : Using enp8s0:192.168.10.202<0>
chianti:24373:24373 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

chianti:24373:24373 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
chianti:24373:24373 [1] NCCL INFO NET/Socket : Using [0]enp8s0:192.168.10.202<0>
chianti:24373:24373 [1] NCCL INFO Using network Socket
chianti:24373:24420 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
chianti:24372:24419 [0] NCCL INFO Channel 00/02 : 0 1
chianti:24372:24419 [0] NCCL INFO Channel 01/02 : 0 1
chianti:24372:24419 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
chianti:24372:24419 [0] NCCL INFO Setting affinity for GPU 0 to ffffff
chianti:24372:24419 [0] NCCL INFO Channel 00 : 0[a000] → 1[42000] via P2P/IPC
chianti:24373:24420 [1] NCCL INFO Channel 00 : 1[42000] → 0[a000] via P2P/IPC
chianti:24372:24419 [0] NCCL INFO Channel 01 : 0[a000] → 1[42000] via P2P/IPC
chianti:24373:24420 [1] NCCL INFO Channel 01 : 1[42000] → 0[a000] via P2P/IPC
chianti:24373:24420 [1] NCCL INFO Connected all rings
chianti:24373:24420 [1] NCCL INFO Connected all trees
chianti:24373:24420 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
chianti:24373:24420 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
chianti:24372:24419 [0] NCCL INFO Connected all rings
chianti:24372:24419 [0] NCCL INFO Connected all trees
chianti:24372:24419 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
chianti:24372:24419 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
chianti:24373:24420 [1] NCCL INFO comm 0x7ff580002f70 rank 1 nranks 2 cudaDev 1 busId 42000 - Init COMPLETE
chianti:24372:24419 [0] NCCL INFO comm 0x7f4eb4002f70 rank 0 nranks 2 cudaDev 0 busId a000 - Init COMPLETE
chianti:24372:24372 [0] NCCL INFO Launch mode Parallel
[E ProcessGroupNCCL.cpp:719] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804805 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:719] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804805 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ‘std::runtime_error’
what(): [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804805 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:406] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ‘std::runtime_error’
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1804805 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 24372) of binary: /home/sajad/anaconda3/envs/myenv/bin/python3
Traceback (most recent call last):
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/runpy.py”, line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/runpy.py”, line 87, in _run_code
exec(code, run_globals)
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 193, in
main()
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 189, in main
launch(args)
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launch.py”, line 174, in launch
run(args)
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/run.py”, line 715, in run
elastic_launch(
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/sajad/anaconda3/envs/myenv/lib/python3.8/site-packages/torch/distributed/launcher/api.py”, line 245, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

examples/pytorch/summarization/run_summarization.py FAILED

Failures:
[1]:
time : 2022-04-03_10:45:32
host : chianti
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 24373)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 24373

Root Cause (first observed failure):
[0]:
time : 2022-04-03_10:45:32
host : chianti
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 24372)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 24372

Searching over the internet, I feel like my problem kind of relates to this issue: A6000 Docker Ligthning ddp NCCL WARN Failed to open libibverbs.so · Issue #73790 · pytorch/pytorch · GitHub. It does seem relevant as I’m also having this problem in a multi-GPU env with A6000 GPUs. As stands, the hack is to disable IOMMU as it locks up the system when used with multi-GPUs. I will try this solution and report the result here. Meanwhile, do you have any other ideas of why this is happening?

Your original description doesn’t really fit the linked issue, as you’ve mentioned your “startup is way too slow” (so I assume your script works after a while but just needs a “warmup” time for unknown reasons) while the linked issue mentions a hang, which never starts to execute the code.
If you are hitting indeed the second issue, then yes: disable IOOMU and rerun the code.
Also, is your code working fine in a single GPU use case?

Yes, as I mentioned in the question, the training code gets stuck and never starts the training; it even does not exit the program with any error codes and so forth, which I assume is what the “hanging” thing is. My code works perfectly fine in single GPU use case. The only problem that I have is with multi-GPU cases. Just to confirm what I have understood, I should disable IOOMU on BIOS, right?

@ptrblck Thank you for your kind suggestion. I could finally get it fixed by disabling IOMMU in BIOS.

1 Like

Great, thanks for the update!

1 Like

Hello, I have the same problem, but my running platform is on a cluster, not a personal computer, so I can’t disable IOMMU. I would like to ask if there are other ways. The graphics card is eight A100