The current workspace is an Ubuntu 20.04 LTS server with AMD EPYC 7502 and 8 A40s (sm_86), and is being accessed remotely through VMware ESXI.
I was using two A40s and was using it without any problems through pytorch 1.11 and cuda 11.2 configurations. One of the A40 GPUs was removed today, and I noticed a slowdown in the Pytorch code I was using. I suspect that there is a conflict between the first CUDA version or library, so I tried to modify the CUDA version and the pytorch version. I followed the settings of other users on the server as they were, but still no improvement.
(Follows the settings of other users on the server)
(1) CUDA 11.3, Pytorch 1.12 (cu113)
(2) CUDA 11.4, Pytorch 1.12 (cu113)
(3) CUDA 11.4, Pytorch 1.7.1
When I run the simple code below
import torch
import time
for i in range(10):
t = time.time()
A = torch.rand(1000).cuda()
B = A*A
print(time.time()-t)
(1) Other users on the server
0.0005707740783691406
6.961822509765625e-05
0.0001308917999267578
0.00011515617370605469
5.4836273193359375e-05
5.3882598876953125e-05
0.00010442733764648438
0.00010275840759277344
0.00010275840759277344
5.3882598876953125e-05
(2) My Case
67.93046927452087
0.00016880035400390625
3.838539123535156e-05
3.552436828613281e-05
3.4332275390625e-05
3.361701965332031e-05
3.2901763916015625e-05
3.361701965332031e-05
3.409385681152344e-05
3.3855438232421875e-05
(Additionally, import torch.utils.benchmark took more than 30 seconds.)
In both of the above settings, numactl or persistant mode is not set and follows the configuration of (3). Perhaps there is a problem loading the cuda kernel or context.
Example from Very Slow moving model to device with model.to(device) · Issue #32192 · pytorch/pytorch · GitHub, in my case, setup time is extremely slow.
import torch
import torch.nn as nn
import timeit
print("Beginning..")
t0 = timeit.default_timer()
if torch.cuda.is_available():
torch.cuda.manual_seed(2809)
torch.backends.cudnn.deterministic = True
device = torch.device('cuda:0')
ngpus = torch.cuda.device_count()
print("Using {} GPU(s)...".format(ngpus))
print("Setup takes {:.2f}".format(timeit.default_timer()-t0))
t1 = timeit.default_timer()
model = nn.Sequential(
nn.Conv2d(3, 6, 3, 1, 1),
nn.ReLU(),
nn.Conv2d(6, 1, 3, 1, 1)
)
print("Model init takes {:.2f}".format(timeit.default_timer()-t1))
if torch.cuda.is_available():
t2 = timeit.default_timer()
model = model.to(device)
print("Model to device takes {:.2f}".format(timeit.default_timer()-t2))
t3 = timeit.default_timer()
torch.cuda.synchronize()
print("Cuda Synch takes {:.2f}".format(timeit.default_timer()-t3))
print('done')
(2) My Case
Beginning..
Using 1 GPU(s)...
Setup takes 64.12
Model init takes 0.00
Model to device takes 1.73
Cuda Synch takes 0.00
done
What should I check in this situation?
(nccl p2p level / cuda launch blocking / omp num thread … ) almost os.environ
setting is useless.