There are two identical cuda devices available (RTX 3090). When we start a training on a first device it is up to 2x faster than for the second device. We need to use torch.cuda.set_device(1) and only after this the training on the second device will be as fast as on the first device.
Could you share a minimal and executable code snippet showing this behavior, please?
Also, the output of python -m torch.utils.collect_env
would be needed.
If possible, could you also share Nsight Systems profiles?
Here is the Nsight Systems report
https://fileport.io/ETHXKZUhrXTC
Here are the outputs of the python -m torch.utils.collect_env
PyTorch version: 1.12.1+cu116
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 10 Pro
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A
Python version: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19044-SP0
Nvidia driver version: 517.40
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin\cudnn_ops_train64_8.dll
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] pytorch-lightning==1.8.3.post1
[pip3] torch==1.12.1+cu116
[pip3] torch-summary==1.4.5
[pip3] torchmetrics==0.11.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.13.0+cu116
[conda] Could not collect
Here is the code I was using
import torch
import time
from tqdm import tqdm
devices=["cuda:0","cuda:1"]
torch.random.manual_seed(0)
def network_train(device):
m = torch.nn.Sequential(
torch.nn.Linear(100,100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 100),
torch.nn.Linear(100, 10),
torch.nn.Linear(10, 1)
)
m.to(device)
x=torch.randn(1000,100)
y=torch.sum(torch.square(x),dim=0)
y=torch.unsqueeze(y, 0)
x=x.to(device)
y=y.to(device)
loss_fn = torch.nn.MSELoss(reduction='mean')
opt=torch.optim.Adam(m.parameters(),lr=0.001)
t = time.time()
for _ in tqdm(range(1000)):
opt.zero_grad()
outputs = m(x)
loss = loss_fn(outputs, y)
loss.backward()
opt.step()
print(time.time() - t, device)
for dev in devices:
network_train(dev)
Thanks for sharing the information!
Unfortunately, I’m no Windows and Gloo expert (I assume you are using Gloo as NCCL is not supported on Windows), so might not know what is causing the issue. Additionally, I also cannot open your shared Nsight Systems profile, so cannot even see the timeline.
The problem is only with Adam optimizer, but there is no such problem for SGD. Moreover using performance profiler it was found that that addcdiv_ function in _TensorBase.py was the reason of the problem. By the way Gloo is used since NCCL is not available.
Here is the analysis summary.
Profiling session duration: 00:44.836
Report file C:/Users/aibotics/Documents/NVIDIA Nsight Systems/Projects/Project 1/Report 3.nsys-rep
Report size 108.83 MiB
Report capture time 3/7/2023, 3:41:46 PM
Number of events collected 4.863.270
Total number of threads 1.202
Host computer DESKTOP-NOTA95K
Profiling stop reason Stopped when last profiled process exited
Show report file in folder
DESKTOP-NOTA95K (0:0)
Target
Target name DESKTOP-NOTA95K
Local time at t=0 2023-03-07T15:39:56.927+01:00
UTC time at t=0 2023-03-07T14:39:56.927Z
TSC value at t=0 696180882860583
OS Windows 10.0.19044
Hardware platform x86_64
Serial number Local
CPU description AMD EPYC 7443P 24-Core Processor
GPU descriptions NVIDIA GeForce RTX 3090
NVIDIA GeForce RTX 3090
Microsoft Basic Render Driver
NVIDIA driver version 517.40
CPU context switch not supported
GPU context switch supported
Tunnel traffic through SSH no
Timestamp counter supported
Process summary
Process ID Name Arguments CPU utilization
12772 python.exe 72,39%
4 System 22,73%
1688 dwm.exe 1,82%
5448 Explorer.EXE 1,23%
18220 conhost.exe 0,42%
1144 csrss.exe 0,34%
14528 msedge.exe 0,31%
1356 msedge.exe 0,25%
16132 chrome.exe 0,16%
6768 cmd.exe 0,13%
15288 chrome.exe 0,12%
2580 MemCompression 0,11%
Information about 2 processes with CPU utilization below 0,10% has been hidden.
Module summary
Process ID Module name Address CPU time (overall)
12772 C:\Windows\System32\win32u.dll [unknown]
1688 C:\Windows\System32\ntdll.dll [unknown]
12772 C:\Windows\System32\ntdll.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\libiomp5md.dll [unknown]
12772 C:\Windows\System32\DriverStore\FileRepository\nv_dispsig.inf_amd64_9751f6c72c23c322\nvcuda64.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_cpu.dll [unknown]
1688 C:\Windows\System32\win32u.dll [unknown]
1144 C:\Windows\System32\win32u.dll [unknown]
5448 C:\Windows\System32\win32u.dll [unknown]
14528 [Unknown] [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\c10.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\python39.dll [unknown]
1356 [Unknown] [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\cublasLt64_11.dll [unknown]
12772 C:\Windows\System32\KernelBase.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_cuda_cu.dll [unknown]
5448 C:\Windows\System32\ntdll.dll [unknown]
18220 C:\Windows\System32\win32u.dll [unknown]
18220 C:\Windows\System32\ntdll.dll [unknown]
16132 [Unknown] [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\c10_cuda.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_python.dll [unknown]
12772 C:\Windows\System32\msvcp140.dll [unknown]
15288 [Unknown] [unknown]
12772 C:\Windows\System32\ucrtbase.dll [unknown]
6768 C:\Windows\System32\ntdll.dll [unknown]
1688 C:\Windows\System32\dwmcore.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\vcruntime140.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\cublas64_11.dll [unknown]
1688 C:\Windows\System32\KernelBase.dll [unknown]
5448 C:\Windows\System32\KernelBase.dll [unknown]
12772 C:\Windows\System32\kernel32.dll [unknown]
1688 C:\Windows\System32\DriverStore\FileRepository\nv_dispsig.inf_amd64_9751f6c72c23c322\nvwgf2umx_cfg.dll [unknown]
1144 C:\Windows\System32\ntdll.dll [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_cuda_cpp.dll [unknown]
1688 C:\Windows\System32\CoreMessaging.dll [unknown]
18220 C:\Windows\System32\KernelBase.dll [unknown]
18220 C:\Windows\System32\conhost.exe [unknown]
12772 C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll [unknown]
Information about 59 modules with CPU utilization below 0,01% has been hidden.
Thread summary
Information about 1202 threads (that have been active at least once) has been captured during the profiling session.
Information about idle threads is not represented here.
Process ID Thread ID Name CPU utilization
12772 13680 undefined 47,56%
4 1068 undefined 18,38%
12772 6084 undefined 4,75%
12772 17356 undefined 4,60%
4 1084 undefined 2,38%
1688 1920 undefined 1,14%
12772 20320 undefined 0,67%
12772 19300 undefined 0,67%
12772 11856 undefined 0,67%
12772 8208 undefined 0,66%
12772 2212 undefined 0,66%
12772 17784 undefined 0,66%
12772 14828 undefined 0,66%
12772 18604 undefined 0,66%
12772 15824 undefined 0,66%
12772 8592 undefined 0,66%
12772 20800 undefined 0,66%
12772 15284 undefined 0,66%
12772 11680 undefined 0,66%
12772 20832 undefined 0,66%
12772 20556 undefined 0,66%
12772 18384 undefined 0,66%
12772 7096 undefined 0,66%
12772 21436 undefined 0,66%
12772 11432 undefined 0,66%
12772 6664 undefined 0,66%
12772 11832 undefined 0,66%
12772 10452 undefined 0,66%
12772 6764 undefined 0,66%
5448 5988 undefined 0,50%
1688 1892 undefined 0,46%
4 1088 undefined 0,40%
4 268 undefined 0,39%
4 232 undefined 0,32%
4 1072 undefined 0,29%
1144 1600 undefined 0,22%
18220 14824 undefined 0,18%
1356 18232 undefined 0,18%
18220 6440 undefined 0,16%
12772 13688 undefined 0,16%
6768 6672 undefined 0,11%
14528 400 undefined 0,11%
1144 1504 undefined 0,11%
1688 968 undefined 0,11%
2580 2616 undefined 0,10%
Information about 280 threads with CPU utilization below 0,10% has been hidden.
There were no samples collected from some threads (perhaps they were idle). Such threads are not displayed.
CPU info
CPU core Socket Core type Max frequency MPIDR
#0 #0 X64 2,85 GHz none
#1 #0 X64 2,85 GHz none
#2 #0 X64 2,85 GHz none
#3 #0 X64 2,85 GHz none
#4 #0 X64 2,85 GHz none
#5 #0 X64 2,85 GHz none
#6 #0 X64 2,85 GHz none
#7 #0 X64 2,85 GHz none
#8 #0 X64 2,85 GHz none
#9 #0 X64 2,85 GHz none
#10 #0 X64 2,85 GHz none
#11 #0 X64 2,85 GHz none
#12 #0 X64 2,85 GHz none
#13 #0 X64 2,85 GHz none
#14 #0 X64 2,85 GHz none
#15 #0 X64 2,85 GHz none
#16 #0 X64 2,85 GHz none
#17 #0 X64 2,85 GHz none
#18 #0 X64 2,85 GHz none
#19 #0 X64 2,85 GHz none
#20 #0 X64 2,85 GHz none
#21 #0 X64 2,85 GHz none
#22 #0 X64 2,85 GHz none
#23 #0 X64 2,85 GHz none
#24 #0 X64 2,85 GHz none
#25 #0 X64 2,85 GHz none
#26 #0 X64 2,85 GHz none
#27 #0 X64 2,85 GHz none
#28 #0 X64 2,85 GHz none
#29 #0 X64 2,85 GHz none
#30 #0 X64 2,85 GHz none
#31 #0 X64 2,85 GHz none
#32 #0 X64 2,85 GHz none
#33 #0 X64 2,85 GHz none
#34 #0 X64 2,85 GHz none
#35 #0 X64 2,85 GHz none
#36 #0 X64 2,85 GHz none
#37 #0 X64 2,85 GHz none
#38 #0 X64 2,85 GHz none
#39 #0 X64 2,85 GHz none
#40 #0 X64 2,85 GHz none
#41 #0 X64 2,85 GHz none
#42 #0 X64 2,85 GHz none
#43 #0 X64 2,85 GHz none
#44 #0 X64 2,85 GHz none
#45 #0 X64 2,85 GHz none
#46 #0 X64 2,85 GHz none
#47 #0 X64 2,85 GHz none
GPU info
1. NVIDIA GeForce RTX 3090
Chip Name GA102
SM Count 82
L2 Cache Size 6,00 MiB
Memory Bandwidth 871,81 GiB/s
Memory Size 24,00 GiB
Core Clock 1,70 GHz
Bus Location 0000:41:00.0
UUID 7e7e1bf6-0fee-d0d1-df8f-27aa769eacb5
2. NVIDIA GeForce RTX 3090
Chip Name GA102
SM Count 82
L2 Cache Size 6,00 MiB
Memory Bandwidth 871,81 GiB/s
Memory Size 24,00 GiB
Core Clock 1,70 GHz
Bus Location 0000:01:00.0
UUID 530ac462-cdba-3ca9-faba-f4ab2e472064
Analysis options
Collect CPU IP samples On
Sampling frequency 1.000 Hz
Collect CPU context switch trace On
Collect backtraces On
Collect OS runtime libraries backtraces Off
Collect NVTX trace Off
Collect CUDA trace Off
Collect OpenGL trace Off
Collect GPU context switch trace Off
Collect DX11 trace Off
Collect DX12 trace Off
Collect Vulkan trace Off
Include child processes On
Collect WDDM trace On
Collect NV Video trace Off
NVIDIA Nsight Systems information
Report captured with 2021.5.2.53-28d0e6e
Report imported with None
Last saved with 2021.5.2.53-28d0e6e
Debug info
Report UUID {c00945da-4f9c-4d92-b826-1e65f3f4897e}
Project UUID {8cee88d8-4300-422e-9142-7e384ff10f72}
Thank you for the update!
Do you see the same behavior when using CUDA_VISIBLE_DEVICES
and any other CUDA application (e.g. CUDA samples)?
With torch.cuda.set_device(1) the test script runs in 7.1 sec , and with CUDA_VISIBLE_DEVICES=1 it takes 11.3 sec. (devices=[“cuda:1”], no training on cuda:0)
I don’t fully understand what
means in this context, as using CUDA_VISIBLE_DEVICES=1
should fail if cuda:1
is used since only a single GPU is visible.
Also, have you had a chance to test any other application?
I did notrun other scripts with multiple gpus recently.
It still works no any errors.
When I run this
import torch
CUDA_VISIBLE_DEVICES=0
print(CUDA_VISIBLE_DEVICES)
print(torch.cuda.device_count())
print(torch.cuda.current_device())
a=torch.tensor([1,2,3])
a=a.to("cuda:1")
print(a)
I get this output
0
2
0
tensor([1, 2, 3], device='cuda:1')
Process finished with exit code 0
When I used
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="1"
and
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
it worked as it supposed to work now devices=[“cuda:0”] takes 7.2 sec for both cases and there is an error for devices=[“cuda:1”]