First call in torch session to torch.cuda.init() takes > 80 seconds

import torch
import time

a =  time.time()
torch.cuda.init()
print(time.time()-a)

I have having uses with cuda. First cuda call to any pytorch session takes almost 80 seconds. During this, nvidia-smi also gets hang and it also freezes other processes using CUDA. Any insight how to debug this problem? I even reinstalled driver but that didn’t work.

Specs: 8xA6000 Nvida GPU, 512 GB RAM and Intel Xenon CPU, torch 2.1.0+cu121

Thanks in advance!

Check if any CUDA hit processes are launched during this “hang”. This should not happen in any of our binaries so did you install PyTorch from another source?

Sorry I don’t know how to check CUDA hit processes are launched during hang. But I also test with nccl-test ran this command

spatel@vulcan:~/nccl-tests$ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1

and it hangs after this

# nThread 1 nGpus 1 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices

FOR 80 - 90 Seconds before proceeding further.

#  Rank  0 Group  0 Pid 210482 on     vulcan device  0 [0x4f] NVIDIA RTX A6000
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     4.52    0.00    0.00      0     0.13    0.06    0.00      0
          16             4     float     sum      -1     3.90    0.00    0.00      0     0.13    0.12    0.00      0
          32             8     float     sum      -1     4.13    0.01    0.00      0     0.13    0.25    0.00      0
          64            16     float     sum      -1     4.04    0.02    0.00      0     0.13    0.51    0.00      0
         128            32     float     sum      -1     4.00    0.03    0.00      0     0.13    1.02    0.00      0
         256            64     float     sum      -1     4.33    0.06    0.00      0     0.12    2.08    0.00      0
         512           128     float     sum      -1     3.89    0.13    0.00      0     0.12    4.10    0.00      0
        1024           256     float     sum      -1     3.96    0.26    0.00      0     0.13    8.10    0.00      0
        2048           512     float     sum      -1     3.84    0.53    0.00      0     0.13   16.15    0.00      0
        4096          1024     float     sum      -1     4.21    0.97    0.00      0     0.13   32.18    0.00      0
        8192          2048     float     sum      -1     4.16    1.97    0.00      0     0.12   65.75    0.00      0
       16384          4096     float     sum      -1     3.92    4.18    0.00      0     0.12  132.61    0.00      0
       32768          8192     float     sum      -1     4.21    7.78    0.00      0     0.12  264.58    0.00      0
       65536         16384     float     sum      -1     3.78   17.34    0.00      0     0.12  527.03    0.00      0
      131072         32768     float     sum      -1     4.00   32.76    0.00      0     0.13  1036.96    0.00      0
      262144         65536     float     sum      -1     4.17   62.91    0.00      0     0.12  2114.06    0.00      0
      524288        131072     float     sum      -1     3.92  133.78    0.00      0     0.13  4180.93    0.00      0
     1048576        262144     float     sum      -1     5.36  195.72    0.00      0     0.14  7679.06    0.00      0
     2097152        524288     float     sum      -1     8.74  240.01    0.00      0     0.12  17168.66    0.00      0
     4194304       1048576     float     sum      -1    15.25  274.96    0.00      0     0.13  32152.58    0.00      0
     8388608       2097152     float     sum      -1    28.46  294.78    0.00      0     0.12  67432.54    0.00      0
    16777216       4194304     float     sum      -1    53.01  316.51    0.00      0     0.12  134973.58    0.00      0
    33554432       8388608     float     sum      -1    102.1  328.76    0.00      0     0.13  258111.02    0.00      0
    67108864      16777216     float     sum      -1    199.9  335.65    0.00      0     0.13  535158.41    0.00      0
   134217728      33554432     float     sum      -1    396.5  338.51    0.00      0     0.12  1078920.64    0.00      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 

I don’t think there is anything wrong with pytorch binaries but I think there something wrong with CUDA or maybe at the hardware level. But I would be very grateful if you can guide me to a proper way to debug this sort of problem.

Thanks!

Sorry, autocorrect butchered it to “hit” while I meant CUDA JIT.
Run top (or any other application to track processes) during the nccl-test and see if anything related to CUDA’s JIT is launched.

Thank you again for the reply. I checked using top, htop and ps command I found nothing related to CUDA JIT but I observed that command ./build/all_reduce_perf used 100% of the cpu during the hang.

Thanks again!

Do you see this slow startup time only when NCCL is used or in any other CUDA application, too?
E.g. could you try any of the CUDA samples?

Yup it also experience hang on normal CUDA operation. I ran this test

spatel@vulcan:~/cuda-samples/bin/x86_64/linux/release$ ./LargeKernelParameter 

It took around 75-80 seconds before getting this

spatel@vulcan:~/cuda-samples/bin/x86_64/linux/release$ ./LargeKernelParameter 
Kernel 4KB parameter limit - time (us):128.908
Kernel 32,764 byte parameter limit - time (us):52.8456
Test passed!

Could you post more information about your setup, i.e. which CUDA driver, OS, etc. you are using?

Sure

OS

Distributor ID:	Ubuntu
Description:	Ubuntu 22.04.3 LTS
Release:	22.04
Codename:	jammy

CUDA Compiler (if it matters)

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0

NVIDIA-SMI

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               On  | 00000000:4F:00.0 Off |                  Off |
| 30%   28C    P8              27W / 300W |      6MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000               On  | 00000000:52:00.0 Off |                  Off |
| 30%   28C    P8              20W / 300W |      6MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000               On  | 00000000:56:00.0 Off |                  Off |
| 30%   26C    P8              20W / 300W |      6MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000               On  | 00000000:57:00.0 Off |                  Off |
| 30%   57C    P2              89W / 300W |   2901MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA RTX A6000               On  | 00000000:CE:00.0 Off |                  Off |
| 30%   27C    P8              25W / 300W |      6MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA RTX A6000               On  | 00000000:D1:00.0 Off |                  Off |
| 30%   28C    P8              25W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA RTX A6000               On  | 00000000:D5:00.0 Off |                  Off |
| 30%   29C    P8              23W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA RTX A6000               On  | 00000000:D6:00.0 Off |                  Off |
| 30%   30C    P8              24W / 300W |      3MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

CPU and RAM info

Intel(R) Xeon(R) Gold 5320 CPU @ 2.20GHz

MemTotal:       527977556 kB
MemFree:        468947204 kB
MemAvailable:   512659408 kB
Buffers:         2773348 kB
Cached:         40899460 kB
SwapCached:            0 kB
Active:          9509664 kB
Inactive:       43019600 kB
Active(anon):       6988 kB
Inactive(anon):  8905892 kB
Active(file):    9502676 kB
Inactive(file): 34113708 kB
Unevictable:       27840 kB
Mlocked:           27840 kB
SwapTotal:       8388604 kB
SwapFree:        8388604 kB
Dirty:               804 kB
Writeback:             0 kB
AnonPages:       8854292 kB
Mapped:          1485956 kB
Shmem:             47352 kB
KReclaimable:    3521812 kB
Slab:            4354572 kB
SReclaimable:    3521812 kB
SUnreclaim:       832760 kB
KernelStack:       29056 kB
PageTables:        98688 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    272377380 kB
Committed_AS:   35776072 kB
VmallocTotal:   13743895347199 kB
VmallocUsed:      508048 kB
VmallocChunk:          0 kB
Percpu:           107072 kB
HardwareCorrupted:     0 kB
AnonHugePages:    694272 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:     6265256 kB
DirectMap2M:    75167744 kB
DirectMap1G:    457179136 kB

Thank you!
Could you also check if any PyTorch CUDA call is showing the issue (I would assume so):

python -c "import torch; torch.randn(1).cuda()"

So far I cannot reproduce the slow execution using 2.1.0+cu121 on an RTX A6000 using driver 535.104.05.

No issues just the hang before finishing the execution. I think something is wrong at the system or hardware level. I am in contact with the manufacturers. Is there any other way to get a DEBUG level logging for CUDA or something?

You could check dmesg for any Xids and report them here.

I execute this command
sudo dmesg | grep Xid

got this output

[ 6835.488541] NVRM: Xid (PCI:0000:4f:00): 43, pid=18343, name=python3, Ch 00000008

Could you check the actual date and check if this Xid can be related or if it was raised on another day?

I had to restart the server can can’t find any Xid error now. :frowning_face:

If no new Xids are raised the original one should be unrelated.