When two cuda devices available training on the second one is slower

There are two identical cuda devices available (RTX 3090). When we start a training on a first device it is up to 2x faster than for the second device. We need to use torch.cuda.set_device(1) and only after this the training on the second device will be as fast as on the first device.

Could you share a minimal and executable code snippet showing this behavior, please?
Also, the output of python -m torch.utils.collect_env would be needed.
If possible, could you also share Nsight Systems profiles?

Here is the Nsight Systems report
https://fileport.io/ETHXKZUhrXTC
Here are the outputs of the python -m torch.utils.collect_env

PyTorch version: 1.12.1+cu116   
Is debug build: False
CUDA used to build PyTorch: 11.6
ROCM used to build PyTorch: N/A 

OS: Microsoft Windows 10 Pro    
GCC version: Could not collect
Clang version: Could not collect
CMake version: Could not collect
Libc version: N/A

Python version: 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.19044-SP0

Nvidia driver version: 517.40
cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin\cudnn_ops_train64_8.dll
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.24.1
[pip3] pytorch-lightning==1.8.3.post1
[pip3] torch==1.12.1+cu116
[pip3] torch-summary==1.4.5
[pip3] torchmetrics==0.11.0
[pip3] torchsummary==1.5.1
[pip3] torchvision==0.13.0+cu116
[conda] Could not collect

Here is the code I was using

import torch
import time
from tqdm import tqdm

devices=["cuda:0","cuda:1"]
torch.random.manual_seed(0)

def network_train(device):
    m = torch.nn.Sequential(
        torch.nn.Linear(100,100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 100),
        torch.nn.Linear(100, 10),
        torch.nn.Linear(10, 1)
    )

    m.to(device)
    x=torch.randn(1000,100)
    y=torch.sum(torch.square(x),dim=0)
    y=torch.unsqueeze(y, 0)
    x=x.to(device)
    y=y.to(device)
    loss_fn = torch.nn.MSELoss(reduction='mean')
    opt=torch.optim.Adam(m.parameters(),lr=0.001)
    t = time.time()
    for _ in tqdm(range(1000)):
        opt.zero_grad()
        outputs = m(x)
        loss = loss_fn(outputs, y)
        loss.backward()
        opt.step()
    print(time.time() - t, device)


for dev in devices:
    network_train(dev)

Thanks for sharing the information!
Unfortunately, I’m no Windows and Gloo expert (I assume you are using Gloo as NCCL is not supported on Windows), so might not know what is causing the issue. Additionally, I also cannot open your shared Nsight Systems profile, so cannot even see the timeline.

The problem is only with Adam optimizer, but there is no such problem for SGD. Moreover using performance profiler it was found that that addcdiv_ function in _TensorBase.py was the reason of the problem. By the way Gloo is used since NCCL is not available.

Here is the analysis summary.

Profiling session duration: 00:44.836
Report file	C:/Users/aibotics/Documents/NVIDIA Nsight Systems/Projects/Project 1/Report 3.nsys-rep
Report size	108.83 MiB
Report capture time	3/7/2023, 3:41:46 PM
Number of events collected	4.863.270
Total number of threads	1.202
Host computer	DESKTOP-NOTA95K
Profiling stop reason	Stopped when last profiled process exited
 Show report file in folder



DESKTOP-NOTA95K (0:0)
Target
Target name	DESKTOP-NOTA95K
Local time at t=0	2023-03-07T15:39:56.927+01:00
UTC time at t=0	2023-03-07T14:39:56.927Z
TSC value at t=0	696180882860583
OS	Windows 10.0.19044
Hardware platform	x86_64
Serial number	Local
CPU description	AMD EPYC 7443P 24-Core Processor
GPU descriptions	NVIDIA GeForce RTX 3090
NVIDIA GeForce RTX 3090
Microsoft Basic Render Driver
NVIDIA driver version	517.40
CPU context switch	not supported
GPU context switch	supported
Tunnel traffic through SSH	no
Timestamp counter	supported
Process summary
Process ID	Name	Arguments	CPU utilization
12772	python.exe		72,39%
4	System		22,73%
1688	dwm.exe		1,82%
5448	Explorer.EXE		1,23%
18220	conhost.exe		0,42%
1144	csrss.exe		0,34%
14528	msedge.exe		0,31%
1356	msedge.exe		0,25%
16132	chrome.exe		0,16%
6768	cmd.exe		0,13%
15288	chrome.exe		0,12%
2580	MemCompression		0,11%
Information about 2 processes with CPU utilization below 0,10% has been hidden.

Module summary
Process ID	Module name	Address	CPU time (overall)
12772	C:\Windows\System32\win32u.dll	[unknown]
1688	C:\Windows\System32\ntdll.dll	[unknown]
12772	C:\Windows\System32\ntdll.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\libiomp5md.dll	[unknown]
12772	C:\Windows\System32\DriverStore\FileRepository\nv_dispsig.inf_amd64_9751f6c72c23c322\nvcuda64.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_cpu.dll	[unknown]
1688	C:\Windows\System32\win32u.dll	[unknown]
1144	C:\Windows\System32\win32u.dll	[unknown]
5448	C:\Windows\System32\win32u.dll	[unknown]
14528	[Unknown]	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\c10.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\python39.dll	[unknown]
1356	[Unknown]	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\cublasLt64_11.dll	[unknown]
12772	C:\Windows\System32\KernelBase.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_cuda_cu.dll	[unknown]
5448	C:\Windows\System32\ntdll.dll	[unknown]
18220	C:\Windows\System32\win32u.dll	[unknown]
18220	C:\Windows\System32\ntdll.dll	[unknown]
16132	[Unknown]	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\c10_cuda.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_python.dll	[unknown]
12772	C:\Windows\System32\msvcp140.dll	[unknown]
15288	[Unknown]	[unknown]
12772	C:\Windows\System32\ucrtbase.dll	[unknown]
6768	C:\Windows\System32\ntdll.dll	[unknown]
1688	C:\Windows\System32\dwmcore.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\vcruntime140.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\cublas64_11.dll	[unknown]
1688	C:\Windows\System32\KernelBase.dll	[unknown]
5448	C:\Windows\System32\KernelBase.dll	[unknown]
12772	C:\Windows\System32\kernel32.dll	[unknown]
1688	C:\Windows\System32\DriverStore\FileRepository\nv_dispsig.inf_amd64_9751f6c72c23c322\nvwgf2umx_cfg.dll	[unknown]
1144	C:\Windows\System32\ntdll.dll	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\torch_cuda_cpp.dll	[unknown]
1688	C:\Windows\System32\CoreMessaging.dll	[unknown]
18220	C:\Windows\System32\KernelBase.dll	[unknown]
18220	C:\Windows\System32\conhost.exe	[unknown]
12772	C:\Users\aibotics\AppData\Local\Programs\Python\Python39\Lib\site-packages\torch\lib\cudnn_cnn_infer64_8.dll	[unknown]
Information about 59 modules with CPU utilization below 0,01% has been hidden.

Thread summary
Information about 1202 threads (that have been active at least once) has been captured during the profiling session.
Information about idle threads is not represented here.

Process ID	Thread ID	Name	CPU utilization
12772	13680	undefined	47,56%
4	1068	undefined	18,38%
12772	6084	undefined	4,75%
12772	17356	undefined	4,60%
4	1084	undefined	2,38%
1688	1920	undefined	1,14%
12772	20320	undefined	0,67%
12772	19300	undefined	0,67%
12772	11856	undefined	0,67%
12772	8208	undefined	0,66%
12772	2212	undefined	0,66%
12772	17784	undefined	0,66%
12772	14828	undefined	0,66%
12772	18604	undefined	0,66%
12772	15824	undefined	0,66%
12772	8592	undefined	0,66%
12772	20800	undefined	0,66%
12772	15284	undefined	0,66%
12772	11680	undefined	0,66%
12772	20832	undefined	0,66%
12772	20556	undefined	0,66%
12772	18384	undefined	0,66%
12772	7096	undefined	0,66%
12772	21436	undefined	0,66%
12772	11432	undefined	0,66%
12772	6664	undefined	0,66%
12772	11832	undefined	0,66%
12772	10452	undefined	0,66%
12772	6764	undefined	0,66%
5448	5988	undefined	0,50%
1688	1892	undefined	0,46%
4	1088	undefined	0,40%
4	268	undefined	0,39%
4	232	undefined	0,32%
4	1072	undefined	0,29%
1144	1600	undefined	0,22%
18220	14824	undefined	0,18%
1356	18232	undefined	0,18%
18220	6440	undefined	0,16%
12772	13688	undefined	0,16%
6768	6672	undefined	0,11%
14528	400	undefined	0,11%
1144	1504	undefined	0,11%
1688	968	undefined	0,11%
2580	2616	undefined	0,10%
Information about 280 threads with CPU utilization below 0,10% has been hidden.

 There were no samples collected from some threads (perhaps they were idle). Such threads are not displayed.

CPU info
CPU core	Socket	Core type	Max frequency	MPIDR
#0	#0	X64	2,85 GHz	none
#1	#0	X64	2,85 GHz	none
#2	#0	X64	2,85 GHz	none
#3	#0	X64	2,85 GHz	none
#4	#0	X64	2,85 GHz	none
#5	#0	X64	2,85 GHz	none
#6	#0	X64	2,85 GHz	none
#7	#0	X64	2,85 GHz	none
#8	#0	X64	2,85 GHz	none
#9	#0	X64	2,85 GHz	none
#10	#0	X64	2,85 GHz	none
#11	#0	X64	2,85 GHz	none
#12	#0	X64	2,85 GHz	none
#13	#0	X64	2,85 GHz	none
#14	#0	X64	2,85 GHz	none
#15	#0	X64	2,85 GHz	none
#16	#0	X64	2,85 GHz	none
#17	#0	X64	2,85 GHz	none
#18	#0	X64	2,85 GHz	none
#19	#0	X64	2,85 GHz	none
#20	#0	X64	2,85 GHz	none
#21	#0	X64	2,85 GHz	none
#22	#0	X64	2,85 GHz	none
#23	#0	X64	2,85 GHz	none
#24	#0	X64	2,85 GHz	none
#25	#0	X64	2,85 GHz	none
#26	#0	X64	2,85 GHz	none
#27	#0	X64	2,85 GHz	none
#28	#0	X64	2,85 GHz	none
#29	#0	X64	2,85 GHz	none
#30	#0	X64	2,85 GHz	none
#31	#0	X64	2,85 GHz	none
#32	#0	X64	2,85 GHz	none
#33	#0	X64	2,85 GHz	none
#34	#0	X64	2,85 GHz	none
#35	#0	X64	2,85 GHz	none
#36	#0	X64	2,85 GHz	none
#37	#0	X64	2,85 GHz	none
#38	#0	X64	2,85 GHz	none
#39	#0	X64	2,85 GHz	none
#40	#0	X64	2,85 GHz	none
#41	#0	X64	2,85 GHz	none
#42	#0	X64	2,85 GHz	none
#43	#0	X64	2,85 GHz	none
#44	#0	X64	2,85 GHz	none
#45	#0	X64	2,85 GHz	none
#46	#0	X64	2,85 GHz	none
#47	#0	X64	2,85 GHz	none
GPU info
1. NVIDIA GeForce RTX 3090
Chip Name	GA102
SM Count	82
L2 Cache Size	6,00 MiB
Memory Bandwidth	871,81 GiB/s
Memory Size	24,00 GiB
Core Clock	1,70 GHz
Bus Location	0000:41:00.0
UUID	7e7e1bf6-0fee-d0d1-df8f-27aa769eacb5
2. NVIDIA GeForce RTX 3090
Chip Name	GA102
SM Count	82
L2 Cache Size	6,00 MiB
Memory Bandwidth	871,81 GiB/s
Memory Size	24,00 GiB
Core Clock	1,70 GHz
Bus Location	0000:01:00.0
UUID	530ac462-cdba-3ca9-faba-f4ab2e472064

Analysis options
Collect CPU IP samples	On
Sampling frequency	1.000 Hz
Collect CPU context switch trace	On
Collect backtraces	On
Collect OS runtime libraries backtraces	Off
Collect NVTX trace	Off
Collect CUDA trace	Off
Collect OpenGL trace	Off
Collect GPU context switch trace	Off
Collect DX11 trace	Off
Collect DX12 trace	Off
Collect Vulkan trace	Off
Include child processes	On
Collect WDDM trace	On
Collect NV Video trace	Off
NVIDIA Nsight Systems information
Report captured with	2021.5.2.53-28d0e6e
Report imported with	None
Last saved with	2021.5.2.53-28d0e6e
Debug info
Report UUID	{c00945da-4f9c-4d92-b826-1e65f3f4897e}
Project UUID	{8cee88d8-4300-422e-9142-7e384ff10f72}

Thank you for the update!
Do you see the same behavior when using CUDA_VISIBLE_DEVICES and any other CUDA application (e.g. CUDA samples)?

With torch.cuda.set_device(1) the test script runs in 7.1 sec , and with CUDA_VISIBLE_DEVICES=1 it takes 11.3 sec. (devices=[“cuda:1”], no training on cuda:0)

I don’t fully understand what

means in this context, as using CUDA_VISIBLE_DEVICES=1 should fail if cuda:1 is used since only a single GPU is visible.
Also, have you had a chance to test any other application?

I did notrun other scripts with multiple gpus recently.
It still works no any errors.
When I run this

import torch


CUDA_VISIBLE_DEVICES=0
print(CUDA_VISIBLE_DEVICES)
print(torch.cuda.device_count())
print(torch.cuda.current_device())
a=torch.tensor([1,2,3])
a=a.to("cuda:1")
print(a)

I get this output

0
2
0
tensor([1, 2, 3], device='cuda:1')

Process finished with exit code 0

When I used

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"  
os.environ["CUDA_VISIBLE_DEVICES"]="1"

and

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"  
os.environ["CUDA_VISIBLE_DEVICES"]="0"

it worked as it supposed to work now devices=[“cuda:0”] takes 7.2 sec for both cases and there is an error for devices=[“cuda:1”]