When variables are transferred between GPUs, their values change

When I was using pytorch2.1, I found the following weird situation. Is this a bug?

Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:58:50) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
8
>>> torch.cuda.is_available()
True
>>> torch.__version__
'2.1.0+cu121'
>>> temp = torch.tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 38, 38]], device='cuda:0')
>>> temp
tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36,
         36, 36, 36, 37, 37, 37, 38, 38]], device='cuda:0')
>>> temp.to('cuda:1')
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:1')
>>> temp.to('cpu' )
tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36,
         36, 36, 36, 37, 37, 37, 38, 38]])
>>> temp.to('cuda:1' )
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:1')
>>> temp.to('cuda:2' )
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:2')
>>> temp.to('cuda:3' )
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0]], device='cuda:3')
>>> temp.to('cuda:4' )
tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36,
         36, 36, 36, 37, 37, 37, 38, 38]], device='cuda:4')
>>> temp.to('cuda:5' )
tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36,
         36, 36, 36, 37, 37, 37, 38, 38]], device='cuda:5')
>>> temp.to('cuda:6' )
tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36,
         36, 36, 36, 37, 37, 37, 38, 38]], device='cuda:6')
>>> temp.to('cuda:7' )
tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36,
         36, 36, 36, 37, 37, 37, 38, 38]], device='cuda:7')
>>>

I measured it under different versions of torch (such as torch2.0.1+cuda118) and different servers A6000 (A6000 Ada), and I found this problem always exists. The test script is as follows:

import torch


gpus = ['cuda:' + str(i) for i in range(8)]

for i in range(8):

    cur_device = gpus[i]

    temp = torch.tensor([[33, 34, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 36, 36, 36, 36, 36, 36, 36, 36, 37, 37, 37, 38, 38]], device=cur_device)

    print('>>>>>>>      ', temp, '      <<<<<<<')

    for j in range(8):

        if i == j:
            continue

        to_device = gpus[j]

        _temp = temp.cpu().to(device=to_device)

        print(cur_device, ' -> ', to_device)
        print(_temp)

Make sure your multi-GPU setup is properly working e.g. by running nccl-tests. If these tests also show data corruption, check the NCCL FAQ as e.g. you might want to disable IOMMU if that’s active.

1 Like

Can you share how you run the script? Did you just run it with python or torchrun?

nccl-test is a binary and the repository shows its usage. PyTorch is not used and irrelevant in these tests.

Thanks! I will check it.

Yes, I just run it with python, like python main.py.

I tested on a machine with 4 H100 and cannot repro it. And looks like this is not distributed related can you fire a github issue in pytorch repo? I guess maybe this is is cuda version specific.