Moving a Tensor to another GPU modifies the data

cwitkowitz · September 28, 2021, 11:06pm

The following code snippet reproduces the error:

>>> import torch
>>> print(torch.__version__)
1.9.0+cu102
>>> test = torch.tensor([0.5]).to('cuda:0')
>>> test
tensor([0.5000], device='cuda:0')
>>> test.to('cuda:1')
tensor([0.], device='cuda:1')

Oddly enough, this only happens on one of my machines.

ptrblck · September 29, 2021, 5:01am

This could indicate an issue in your node setup and a broken communication.
Could you run the NCCL-tests and see, if any issues are detected while communicating between the devices?

cwitkowitz · October 7, 2021, 11:35pm

Is NCCL required? I didn’t have it installed previously. I’m not using both of the GPUs for a single task. I’m simply trying to transfer data from ‘cuda:0’ to ‘cuda:1’ offline.

It is interesting to note that when running the nccl-tests, I am seeing a lot of lines with “IO_PAGE_FAULT”.

ptrblck · October 8, 2021, 5:17am

Yes, NCCL would be needed to run the tests.
The IO_PAGE_FAULT issues are also most likely shown in dmesg when you run your previous code snippet, which could point towards a faulty communication between these devices.
Try to disable IOMMU as described here.

cwitkowitz · October 15, 2021, 1:06pm

Thanks for pointing me in the right direction. You were right about the IO_PAGE_FAULT messages being shown in dmesg. I was able to fix the problem by following ngimel’s comment in Multi-GPU K80s · Issue #1637 · pytorch/pytorch · GitHub.

In my previous comment, I was wondering whether NCCL is required in general (not only for the tests), since everything seemed to be working without it (besides my issue in this post). Could you please elaborate on how/when NCCL would be beneficial?

ptrblck · October 15, 2021, 6:16pm

NCCL is one backend for device communications between NVIDIA GPUs. PyTorch provides multiple backends (gloo, MPI) so you could pick one of them.

Good to hear you were able to solve the issue!