Torch's to()'s D2D Bandwidth is way lower than just call cudaMemcpyAsync()

Hello Community. :slight_smile:

I’ve got a way different NVLink bandwidth on NVIDIA’s p2pBandwidthLatencyTest and just simple pytorch Tensor.to() microbenchmark.

import torch

n_test = 50

size = 1 * 1024**3

src_stream = torch.cuda.Stream(0)
dst_stream = torch.cuda.Stream(1)

begin = [torch.cuda.Event(enable_timing=True) for _ in range(n_test)]
end = [torch.cuda.Event(enable_timing=True) for _ in range(n_test)]

print(
    torch.cuda.can_device_access_peer("cuda:0", "cuda:1"),
    torch.cuda.can_device_access_peer("cuda:1", "cuda:0"),
)

for idx in range(n_test):
    with torch.cuda.stream(src_stream), torch.cuda.stream(dst_stream):
        src_tensor = torch.zeros(size, dtype=torch.int8, device="cuda:0")

        begin[idx].record(src_stream)
        dst_tensor = src_tensor.to("cuda:1", non_blocking=True)
        end[idx].record(src_stream)

    src_stream.synchronize()
    dst_stream.synchronize()

for idx in range(n_test):
    print(
        idx,
        (size / 1 * 1024**3) / (begin[idx].elapsed_time(end[idx]) / 1000),
        "GiB/s",
    )

With Nvidia’s p2pBandwidthLatencyTest ~260 GiB/s
With PyTorch: ~180 GiB/s

And, also when I profile them with nsight system. It shows that there is a difference in utilization of NVLink Bandwidth like the below screenshots.

With PyTorch
image

With Nvidia’s p2pBandwidthLatencyTest
image

I’ve tested transferring various sizes of tensors, it doesn’t seem to change.
I’ve tested this with CUDA-12.1, 12.4, PyTorch 2.1.0, 2.4.0, and even with libtorch.

My test setup is like Ubuntu 22.04, A100-PCIE-40GB * 2ea with NVLink

Does anyone know how to increase this NVLink Bandwidth? and why NVLink Bandwidths are slowed down with Torch?

I don’t believe your code is measuring what you are expecting since just executing it on my system shows:

True True
0 1.8611069056223566e+20 GiB/s
1 2.858158028990284e+20 GiB/s
2 2.8896318645194097e+20 GiB/s
3 2.879378253393044e+20 GiB/s
4 2.88518891328474e+20 GiB/s
5 2.885281356930677e+20 GiB/s
6 2.856911514236231e+20 GiB/s
7 2.870820626852414e+20 GiB/s
8 2.860109177266271e+20 GiB/s
9 2.8837802682572302e+20 GiB/s
10 2.8664580582636336e+20 GiB/s
11 2.892717548554107e+20 GiB/s
...

Using a synchronized host timer shows the expected throughput and comes close to the p2pbandwidth output:

True True
120.0287539103961
172.00936547064114
243.15145801687785
246.97742448140528
245.87622759793612
246.23833937204734
245.2449410922305
245.96833192313278
245.55683371862324
245.98291216146774
245.2250435656027
245.91504775655062
246.70707711492975
246.7607121895865
246.61070882572474
246.08115898634074
245.74239032439328
247.3937729274789
...
./p2pBandwidthLatencyTest
...
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1      2      3      4      5      6      7 
     0 1275.51 262.12 262.87 273.39 274.27 273.18 272.84 273.50 
     1 262.01 1286.01 263.14 273.80 273.30 273.74 273.65 273.77 
     2 261.31 263.34 1282.84 272.24 273.98 273.20 274.26 273.93 
     3 263.15 265.36 263.82 1301.00 274.83 273.43 275.73 275.37 
     4 263.06 264.02 263.27 275.80 1303.17 273.29 274.73 274.76 
     5 263.67 264.33 263.92 274.93 273.94 1301.00 274.82 273.04 
     6 262.78 263.84 263.52 264.28 275.18 275.69 1298.84 273.23 
     7 265.20 264.37 264.31 265.14 274.91 275.44 274.76 1302.08 

Sorry for posting the wrong code earlier.

It seems the bandwidth issue isn’t related to Torch after all.
However, I’m still unsure about the root cause of the bandwidth problem I’m experiencing on my machine, with and without Torch.

Would you give me some advice on how to identify the cause of this gap?

Well,

I lock the GPU clock with sudo nvidia-smi -i 0,1 --lock-gpu-clocks=1095
NVLink Bandwidth with Torch works as a charm.

I don’t understand what’s happening…well…
Close the issues.