Pytorch CUDA Initialization is Extremely Slow in A40 GPU (suddenly)

The current workspace is an Ubuntu 20.04 LTS server with AMD EPYC 7502 and 8 A40s (sm_86), and is being accessed remotely through VMware ESXI.

I was using two A40s and was using it without any problems through pytorch 1.11 and cuda 11.2 configurations. One of the A40 GPUs was removed today, and I noticed a slowdown in the Pytorch code I was using. I suspect that there is a conflict between the first CUDA version or library, so I tried to modify the CUDA version and the pytorch version. I followed the settings of other users on the server as they were, but still no improvement.

(Follows the settings of other users on the server)

(1) CUDA 11.3, Pytorch 1.12 (cu113)
(2) CUDA 11.4, Pytorch 1.12 (cu113)
(3) CUDA 11.4, Pytorch 1.7.1

When I run the simple code below

import torch
import time
for i in range(10):
    t = time.time()
    A = torch.rand(1000).cuda()
    B = A*A
    print(time.time()-t)

(1) Other users on the server

0.0005707740783691406
6.961822509765625e-05
0.0001308917999267578
0.00011515617370605469
5.4836273193359375e-05
5.3882598876953125e-05
0.00010442733764648438
0.00010275840759277344
0.00010275840759277344
5.3882598876953125e-05

(2) My Case

67.93046927452087
0.00016880035400390625
3.838539123535156e-05
3.552436828613281e-05
3.4332275390625e-05
3.361701965332031e-05
3.2901763916015625e-05
3.361701965332031e-05
3.409385681152344e-05
3.3855438232421875e-05

(Additionally, import torch.utils.benchmark took more than 30 seconds.)

In both of the above settings, numactl or persistant mode is not set and follows the configuration of (3). Perhaps there is a problem loading the cuda kernel or context.

Example from Very Slow moving model to device with model.to(device) · Issue #32192 · pytorch/pytorch · GitHub, in my case, setup time is extremely slow.

import torch
import torch.nn as nn
import timeit

print("Beginning..")

t0 = timeit.default_timer()
if torch.cuda.is_available():
    torch.cuda.manual_seed(2809)
    torch.backends.cudnn.deterministic = True
    device = torch.device('cuda:0')
    ngpus = torch.cuda.device_count()
    print("Using {} GPU(s)...".format(ngpus))
print("Setup takes {:.2f}".format(timeit.default_timer()-t0))

t1 = timeit.default_timer()
model = nn.Sequential(
    nn.Conv2d(3, 6, 3, 1, 1),
    nn.ReLU(),
    nn.Conv2d(6, 1, 3, 1, 1)
)
print("Model init takes {:.2f}".format(timeit.default_timer()-t1))


if torch.cuda.is_available():
    t2 = timeit.default_timer()
    model = model.to(device)
print("Model to device takes {:.2f}".format(timeit.default_timer()-t2))

t3 = timeit.default_timer()
torch.cuda.synchronize()
print("Cuda Synch takes {:.2f}".format(timeit.default_timer()-t3))

print('done')

(2) My Case

Beginning..
Using 1 GPU(s)...
Setup takes 64.12
Model init takes 0.00
Model to device takes 1.73
Cuda Synch takes 0.00
done

What should I check in this situation?
(nccl p2p level / cuda launch blocking / omp num thread … ) almost os.environ setting is useless.

This problem could be solved by reinstalling CUDA, CUDNN, and NVIDIA Drivers. It was not known what suddenly happened to the normal Conda environment, but it is expected that a conflict between internal drivers may have been the cause.

First, when I deleted the driver with apt-get purge, there was a remaining file, so I had to search and remove it one by one with dpkg -l | grep. And to match the compatibility between CUDA versions, nvidia-driver 510, cuda 11.6, and cudnn 8.4.0.27 versions were installed. When installing dev from the NVIDIA official website, the latest version is installed, so I installed it through runfile. At this time, nvidia-driver should not be installed together, and nvidia-driver for CUDA 11.6 version had to be reinstalled separately. When installing 470 and 520 versions with different nvidia-driver versions, the existing problem could not be solved.