Direct P2P GPU <-> GPU communication with torch.to does not seem to work

Hi,

I’ve been looking at direct GPU ↔ GPU communication using the tensor.to pytorch function and I’ve found that it doesn’t seem to be able to copy the tensor from one CUDA device to the other directly.

I’m sorry if I’ve missed something obvious but I didn’t see anywhere that this shouldn’t work as expected

import torch
import importlib.metadata
import time


print(f"torch version: {importlib.metadata.version('torch')}\n")

print(f"cuda is_available: {torch.cuda.is_available()}")
d_count = torch.cuda.device_count()
print(f"device_count(): {d_count}")
for idx in range(d_count):
    print(f"get_device_name({idx}): {torch.cuda.get_device_name(idx)}")
    print(f"get_device_properties({idx}): {torch.cuda.get_device_properties(idx)}")
    print(f"get_device_capability({idx}): {torch.cuda.get_device_capability(idx)}")
print(f"current device: {torch.cuda.current_device()}\n")

if d_count > 1:
    access_mat = torch.zeros((d_count, d_count), dtype=torch.bool)
    for i in range(d_count):
        for j in range(d_count):
            access_mat[i, j] = (
                torch.cuda.can_device_access_peer(i, j) if i != j else True
            )

    print("Devices access matrix:\n", access_mat.data, "\n")


def get_tensor_info(name, t):
    return f"{name:10s} -> device:{t.device}, dtype:{t.dtype}, shape:{t.shape}, mean:{t.mean()}"


class DistributedMatMul(torch.nn.Module):
    def __init__(self, D):
        super().__init__()

        self.device0 = torch.device("cuda", 0)
        self.device1 = torch.device("cuda", 1)

        self.w0 = torch.ones((D, 2 * D), dtype=torch.float32, device=self.device0)
        self.w1 = torch.ones((2 * D, D), dtype=torch.float32, device=self.device1)

    def forward(self, x):
        x_gpu_0 = x.to(self.device0)

        y0 = x_gpu_0 @ self.w0
        print(f"{'y0':10s} -> {y0}")

        # y0_gpu_1 = y0.to("cpu").to(self.device1) # This work

        y0_gpu_1 = y0.to(self.device1)  # This does not work
        print(
            f"{'y0_gpu_1':10s} -> {y0_gpu_1}"
        )  # should return [[2., 2., 2., 2.]] but returns [[1., 0., 0., 0.]]

        y1 = y0_gpu_1 @ self.w1

        y_cpu = y1.cpu().mean()

        return y_cpu

    def __str__(self):
        w0 = get_tensor_info("w0", self.w0)
        w1 = get_tensor_info("w1", self.w1)
        return f"{w0}\n{w1}"


torch.manual_seed(0)

N = 1
D = 2

model = DistributedMatMul(D)
print(model)
x_cpu = torch.ones((N, D), dtype=torch.float32, device="cpu")
y_cpu = model(x_cpu)

# Returns 0.0 when it does not work, should return 8.0
print(get_tensor_info("y_cpu", y_cpu))

######
# Output for the above code
######
# torch version: 2.1.2.post301

# cuda is_available: True
# device_count(): 2
# get_device_name(0): NVIDIA Graphics Device
# get_device_properties(0): _CudaDeviceProperties(name='NVIDIA Graphics Device', major=8, minor=9, total_memory=15868MB, multi_processor_count=66)
# get_device_capability(0): (8, 9)
# get_device_name(1): NVIDIA Graphics Device
# get_device_properties(1): _CudaDeviceProperties(name='NVIDIA Graphics Device', major=8, minor=9, total_memory=15868MB, multi_processor_count=66)
# get_device_capability(1): (8, 9)
# current device: 0

# Devices access matrix:
#  tensor([[True, True],
#         [True, True]])

# w0         -> device:cuda:0, dtype:torch.float32, shape:torch.Size([2, 4]), mean:1.0
# w1         -> device:cuda:1, dtype:torch.float32, shape:torch.Size([4, 2]), mean:1.0
# y0         -> tensor([[2., 2., 2., 2.]], device='cuda:0')
# y0_gpu_1   -> tensor([[1., 0., 0., 0.]], device='cuda:1') <------------ !!!! The copy is completely wrong
# y_cpu      -> device:cpu, dtype:torch.float32, shape:torch.Size([]), mean:1.0

I also ran the Nvidia cuda samples p2pBandwidthLatencyTest

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA Graphics Device, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA Graphics Device, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0	     1     1
     1	     1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 608.45  12.14 
     1  12.09 613.95 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 608.92  13.55 
     1  13.55 614.19 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 611.34  17.34 
     1  17.24 613.59 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 611.43  27.10 
     1  27.10 613.69 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.26  10.37 
     1  10.28   1.22 

   CPU     0      1 
     0   1.34   4.23 
     1   4.24   1.32 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.27   0.91 
     1   0.90   1.22 

   CPU     0      1 
     0   1.40   1.09 
     1   1.11   1.33 

And got the topology from nvidia-smi topo -m:

	GPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	0-23	0		N/A
GPU1	PHB	 X 	0-23	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge

Run NCCL tests for numerical mismatches and check if e.g. IOMMU is used and is causing issues.

Thanks for taking the time to answer.

I’ve disabled IOMMU at the bios level and also disabled ACS for all possible devices in my machine. This does not change anything.

Summary so far:

Running device_to_device_memcpy_read_ce.
 Invalid value when checking the pattern at <0x7fefec000000>
 Current offset [ 0/67108864]

So the data is not copied properly but driver 545.* still thinks it can do it.

Do you know how to theoretically know if I should have P2P capabilities or not?

So I ended up validating the fact that the 40* serie does not support P2P at all and driver 545.* is wrongly showing it can.

Wrote a quick post-mortem here for futur troubleshooters: Multi-gpu (Nvidia) P2P capabilities and debugging tips | by Morgan | Feb, 2024 | Medium