DDP training on RTX 4090 (ADA, cu118)

Thank you for checking. It doesn’t change the behavior. I also disabled IOMMU and SVM in bios.

import os
os.environ["NCCL_P2P_DISABLE"] = "1"
​import torch

print('Test 1')
v = torch.randn(5, device='cuda:0')
print(v)
print(v.to('cuda:1'))
print(v.to('cpu').to('cuda:1'))

print('Test 2')
v = torch.randn(5, device='cuda:0')
print(v)
print(v.to('cuda:1'))
print(v.to('cpu').to('cuda:1'))

Test 1
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:0’)
tensor([0., 0., 0., 0., 0.], device=‘cuda:1’)
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:1’)
Test 2
tensor([-0.5404, -1.6951, -0.4220, -0.9484, 0.1218], device=‘cuda:0’)
tensor([-0.1360, -1.5022, -1.9172, 0.8753, 0.5528], device=‘cuda:1’)
tensor([-0.5404, -1.6951, -0.4220, -0.9484, 0.1218], device=‘cuda:1’)

It is very likely NVIDIA driver related issue( I just finished building 2x4090 system, and in the initial testing I realized that PyTorch is not working properly with multiple GPUs. Hopefully it will be fixed by NVIDIA soon.