Tensor totally changes when allocating moving from gpu to gpu

Hi,
I was training a network using a single gpu, alright.
As the gpu utilization was bit low I decided to do the preprocessing in a second gpu allocating tensors in dataset’s ‘getitem’ and working on the main thread.
Evberything ok.
Then I realized that when I move my ground-truth from cuda:1 to cuda:0 the tensor totally changes to a completely different one.

(Trying to reproduce but console crashed not freezing gpu memory… :confused: )
Any idea?

I don’t have an idea yet, but it sounds like a bug so we would need to dig into it.
Could you explain your use case a bit so that we might try to work on a reproduction as well?

Well it’s nothing out of the box.
I could reproduce something similar just by using the console.

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')

After restarting the computer and running the aforementioned code B is all zeros.

Originally I discovered the issue using following a "complex pipeline which involves STFT, functional.grid_sample and einsum but I think that’s irrelevant.

Originally the tensor was stored in a dictionary

vars['gt']=vars['gt'].to(torch.device(0))

Afterwards vars[‘gt’] became a tensor bounded between -1 and 5 at gpu 0.
Even trying

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')
c=b.cpu() 

just to force it to be on cpu (avoiding sync problems). Neither using torch.cuda.synchronize() between commands.
Torch version is 1.2.0 according to torch.version
cuda version is 10.0.130
ipython 3.6.8, default version
Nvidia-smi 410.48
Cuda0: Quadro P6000
Cuda1: Gefore GTX 1080 Ti

Could you update PyTorch to the latest version, please, and retry the code?

If I’m not mistaken, we’ve seen a similar issue some time ago, which boiled down to a hardware issue, but I can’t find the post.
Maybe @albanD remembers it.

I don’t remember exactly but I would bet on hardware issue as well. Can’t find the thread either.

Hi @ptrblck, @albanD
It keeps happening with pytorch 1.4 and driver 440.
I discovered something iteresting.
It does happed going from GTX1080Ti to P6000 but not the other way around.

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() 

So this fails

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda()
b=a.cuda(1)

But this works

Could you try to disable P2P access via these env variables and check the behavior?

Hi,
Exactly the same behaviour.

Is there any new about this?

Unfortunately, no updates as I cannot reproduce this issue and haven’t seen it before.
I would recommend to:

  • update PyTorch, the NVIDIA driver, etc. to the latest version
  • try different environments, such as the official PyTorch docker container or the NVIDIA NGC one, which uses a newer NCCL version
  • run the PyTorch tests (in particular the CUDA tests) on both GPUs and check, if you are seeing some numerical issues

If that doesn’t help, I wouldn’t exclude a hardware defect.

Hi,
So I was running the tests

-i test_cuda -i test_cuda_primary_ctx -i test_torch -i test_expecttest -i test_foreach

and
raiases this

Fail to import hypothesis in common_utils, tests are not derandomized

test_foreach raises

AttributeError: module 'torch' has no attribute '_foreach_add'

ant test_torch

  File "test_torch.py", line 29, in <module>
    from torch.testing._internal.common_utils import TestCase, iter_indices, TEST_NUMPY, TEST_SCIPY, \
ImportError: cannot import name 'wrapDeterministicFlagAPITest'

Hi,

This kind of error AttributeError: module 'torch' has no attribute '_foreach_add' where you’re missing a cpp API usually means that your python code for your install is not the same version as the bianry install.

That happens if you do setup.py develop and then update your local repo.
You might want to clean up your install here.

Soo if my binary install is 1.6.0
I understand that I should use the tests under the tag
1.6.fix4 for example?
There are several 1.6 tags in GitHub

If you’re using a binary install, that means that you either have both a binary and develop install at the same time that conflict. Or you have a folder called torch in your current directory.