Tensor totally changes when allocating moving from gpu to gpu

Hi,
I was training a network using a single gpu, alright.
As the gpu utilization was bit low I decided to do the preprocessing in a second gpu allocating tensors in dataset’s ‘getitem’ and working on the main thread.
Evberything ok.
Then I realized that when I move my ground-truth from cuda:1 to cuda:0 the tensor totally changes to a completely different one.

(Trying to reproduce but console crashed not freezing gpu memory… :confused: )
Any idea?

I don’t have an idea yet, but it sounds like a bug so we would need to dig into it.
Could you explain your use case a bit so that we might try to work on a reproduction as well?

Well it’s nothing out of the box.
I could reproduce something similar just by using the console.

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')

After restarting the computer and running the aforementioned code B is all zeros.

Originally I discovered the issue using following a "complex pipeline which involves STFT, functional.grid_sample and einsum but I think that’s irrelevant.

Originally the tensor was stored in a dictionary

vars['gt']=vars['gt'].to(torch.device(0))

Afterwards vars[‘gt’] became a tensor bounded between -1 and 5 at gpu 0.
Even trying

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')
c=b.cpu() 

just to force it to be on cpu (avoiding sync problems). Neither using torch.cuda.synchronize() between commands.
Torch version is 1.2.0 according to torch.version
cuda version is 10.0.130
ipython 3.6.8, default version
Nvidia-smi 410.48
Cuda0: Quadro P6000
Cuda1: Gefore GTX 1080 Ti

Could you update PyTorch to the latest version, please, and retry the code?

If I’m not mistaken, we’ve seen a similar issue some time ago, which boiled down to a hardware issue, but I can’t find the post.
Maybe @albanD remembers it.

I don’t remember exactly but I would bet on hardware issue as well. Can’t find the thread either.

Hi @ptrblck, @albanD
It keeps happening with pytorch 1.4 and driver 440.
I discovered something iteresting.
It does happed going from GTX1080Ti to P6000 but not the other way around.

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() 

So this fails

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda()
b=a.cuda(1)

But this works

Could you try to disable P2P access via these env variables and check the behavior?

Hi,
Exactly the same behaviour.

Is there any new about this?

Unfortunately, no updates as I cannot reproduce this issue and haven’t seen it before.
I would recommend to:

  • update PyTorch, the NVIDIA driver, etc. to the latest version
  • try different environments, such as the official PyTorch docker container or the NVIDIA NGC one, which uses a newer NCCL version
  • run the PyTorch tests (in particular the CUDA tests) on both GPUs and check, if you are seeing some numerical issues

If that doesn’t help, I wouldn’t exclude a hardware defect.

Hi,
So I was running the tests

-i test_cuda -i test_cuda_primary_ctx -i test_torch -i test_expecttest -i test_foreach

and
raiases this

Fail to import hypothesis in common_utils, tests are not derandomized

test_foreach raises

AttributeError: module 'torch' has no attribute '_foreach_add'

ant test_torch

  File "test_torch.py", line 29, in <module>
    from torch.testing._internal.common_utils import TestCase, iter_indices, TEST_NUMPY, TEST_SCIPY, \
ImportError: cannot import name 'wrapDeterministicFlagAPITest'

Hi,

This kind of error AttributeError: module 'torch' has no attribute '_foreach_add' where you’re missing a cpp API usually means that your python code for your install is not the same version as the bianry install.

That happens if you do setup.py develop and then update your local repo.
You might want to clean up your install here.

Soo if my binary install is 1.6.0
I understand that I should use the tests under the tag
1.6.fix4 for example?
There are several 1.6 tags in GitHub

If you’re using a binary install, that means that you either have both a binary and develop install at the same time that conflict. Or you have a folder called torch in your current directory.

I’m experiencing this now on torch 1.13.1 on a K80. (the newer driver for torch 2 doesn’t support the K80 (edit: i guess this was a quirk of the image with newer nccl, maybe i will try torch 2).)

I have the tests running. It’s a little confusing because pytorch docker images can come with torch installed via conda, which can clash with a source install.

It seems ideally there’d be a way to log the cuda calls and narrow the problem down further.

I have another K80 I can swap in to try, too, although they’ve both weathered the same storms and could be broken similarly if there’s damage.

I encountered a similar issue when I was using a machine with multiple RTX4090 GPUs.
It can be reproduced simply by running this snippet on the console:

>>> l = torch.tensor(1, device='cuda:0')
>>> l.to('cuda:1')
tensor(0, device='cuda:1')

It happens for both pytorch 2.0.0 and 1.10.0 on two RTX4090s, but it does NOT happen on two RTX3090 GPUs (pytorch 1.10.0).

The issue can temporarily be solved by moving to CPU first and then to the second GPU:

>>> l = torch.tensor(1, device='cuda:0')
>>> l.to('cpu').to('cuda:1')
tensor(1, device='cuda:1')

Any thoughts?

I’m afraid I couldn’t find a solution. I think I did some experiments and it was machine-related but cannot really ensure.

Best
Juan

Disable P2P access between the 4090s via NCCL_P2P_DISABLE=1 or update to the latest driver.

Thanks. I tried NCCL_P2P_DISABLE=1 but it doesn’t work. I don’t have access to the driver update so I cannot try this way either.