I’ve got a way different NVLink bandwidth on NVIDIA’s p2pBandwidthLatencyTest and just simple pytorch Tensor.to() microbenchmark.
import torch
n_test = 50
size = 1 * 1024**3
src_stream = torch.cuda.Stream(0)
dst_stream = torch.cuda.Stream(1)
begin = [torch.cuda.Event(enable_timing=True) for _ in range(n_test)]
end = [torch.cuda.Event(enable_timing=True) for _ in range(n_test)]
print(
torch.cuda.can_device_access_peer("cuda:0", "cuda:1"),
torch.cuda.can_device_access_peer("cuda:1", "cuda:0"),
)
for idx in range(n_test):
with torch.cuda.stream(src_stream), torch.cuda.stream(dst_stream):
src_tensor = torch.zeros(size, dtype=torch.int8, device="cuda:0")
begin[idx].record(src_stream)
dst_tensor = src_tensor.to("cuda:1", non_blocking=True)
end[idx].record(src_stream)
src_stream.synchronize()
dst_stream.synchronize()
for idx in range(n_test):
print(
idx,
(size / 1 * 1024**3) / (begin[idx].elapsed_time(end[idx]) / 1000),
"GiB/s",
)
With Nvidia’s p2pBandwidthLatencyTest ~260 GiB/s
With PyTorch: ~180 GiB/s
And, also when I profile them with nsight system. It shows that there is a difference in utilization of NVLink Bandwidth like the below screenshots.
With PyTorch
With Nvidia’s p2pBandwidthLatencyTest
I’ve tested transferring various sizes of tensors, it doesn’t seem to change.
I’ve tested this with CUDA-12.1, 12.4, PyTorch 2.1.0, 2.4.0, and even with libtorch.
My test setup is like Ubuntu 22.04, A100-PCIE-40GB * 2ea with NVLink
Does anyone know how to increase this NVLink Bandwidth? and why NVLink Bandwidths are slowed down with Torch?
It seems the bandwidth issue isn’t related to Torch after all.
However, I’m still unsure about the root cause of the bandwidth problem I’m experiencing on my machine, with and without Torch.
Would you give me some advice on how to identify the cause of this gap?