Thank you for checking. It doesn’t change the behavior. I also disabled IOMMU and SVM in bios.
import os
os.environ["NCCL_P2P_DISABLE"] = "1"
import torch
print('Test 1')
v = torch.randn(5, device='cuda:0')
print(v)
print(v.to('cuda:1'))
print(v.to('cpu').to('cuda:1'))
print('Test 2')
v = torch.randn(5, device='cuda:0')
print(v)
print(v.to('cuda:1'))
print(v.to('cpu').to('cuda:1'))
Test 1
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:0’)
tensor([0., 0., 0., 0., 0.], device=‘cuda:1’)
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:1’)
Test 2
tensor([-0.5404, -1.6951, -0.4220, -0.9484, 0.1218], device=‘cuda:0’)
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:1’)
tensor([-0.5404, -1.6951, -0.4220, -0.9484, 0.1218], device=‘cuda:1’)
It is very likely NVIDIA driver related issue( I just finished building 2x4090 system, and in the initial testing I realized that PyTorch is not working properly with multiple GPUs. Hopefully it will be fixed by NVIDIA soon.