Tensor totally changes when allocating moving from gpu to gpu

JuanFMontesinos · March 21, 2020, 5:10am

Hi,
I was training a network using a single gpu, alright.
As the gpu utilization was bit low I decided to do the preprocessing in a second gpu allocating tensors in dataset’s ‘getitem’ and working on the main thread.
Evberything ok.
Then I realized that when I move my ground-truth from cuda:1 to cuda:0 the tensor totally changes to a completely different one.

(Trying to reproduce but console crashed not freezing gpu memory… )
Any idea?

ptrblck · March 21, 2020, 5:12am

I don’t have an idea yet, but it sounds like a bug so we would need to dig into it.
Could you explain your use case a bit so that we might try to work on a reproduction as well?

JuanFMontesinos · March 21, 2020, 7:10am

Well it’s nothing out of the box.
I could reproduce something similar just by using the console.

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')

After restarting the computer and running the aforementioned code B is all zeros.

Originally I discovered the issue using following a "complex pipeline which involves STFT, functional.grid_sample and einsum but I think that’s irrelevant.

Originally the tensor was stored in a dictionary

vars['gt']=vars['gt'].to(torch.device(0))

Afterwards vars[‘gt’] became a tensor bounded between -1 and 5 at gpu 0.
Even trying

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda() # Same effect with a.to('cuda:0')
c=b.cpu()

just to force it to be on cpu (avoiding sync problems). Neither using torch.cuda.synchronize() between commands.
Torch version is 1.2.0 according to torch.version
cuda version is 10.0.130
ipython 3.6.8, default version
Nvidia-smi 410.48
Cuda0: Quadro P6000
Cuda1: Gefore GTX 1080 Ti

ptrblck · March 22, 2020, 4:21am

Could you update PyTorch to the latest version, please, and retry the code?

If I’m not mistaken, we’ve seen a similar issue some time ago, which boiled down to a hardware issue, but I can’t find the post.
Maybe @albanD remembers it.

albanD · March 23, 2020, 3:20pm

I don’t remember exactly but I would bet on hardware issue as well. Can’t find the thread either.

JuanFMontesinos · March 31, 2020, 1:06am

Hi @ptrblck, @albanD
It keeps happening with pytorch 1.4 and driver 440.
I discovered something iteresting.
It does happed going from GTX1080Ti to P6000 but not the other way around.

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda(1)
b=a.cuda()

So this fails

import torch
a=(torch.rand(70,1,256,256)>0.5).float().cuda()
b=a.cuda(1)

But this works

ptrblck · March 31, 2020, 6:09am

Could you try to disable P2P access via these env variables and check the behavior?

JuanFMontesinos · March 31, 2020, 11:40am

Hi,
Exactly the same behaviour.

JuanFMontesinos · July 10, 2020, 9:08pm

Is there any new about this?

ptrblck · July 11, 2020, 4:00am

Unfortunately, no updates as I cannot reproduce this issue and haven’t seen it before.
I would recommend to:

update PyTorch, the NVIDIA driver, etc. to the latest version
try different environments, such as the official PyTorch docker container or the NVIDIA NGC one, which uses a newer NCCL version
run the PyTorch tests (in particular the CUDA tests) on both GPUs and check, if you are seeing some numerical issues

If that doesn’t help, I wouldn’t exclude a hardware defect.

JuanFMontesinos · August 21, 2020, 4:15pm

Hi,
So I was running the tests

-i test_cuda -i test_cuda_primary_ctx -i test_torch -i test_expecttest -i test_foreach

and
raiases this

Fail to import hypothesis in common_utils, tests are not derandomized

test_foreach raises

AttributeError: module 'torch' has no attribute '_foreach_add'

ant test_torch

  File "test_torch.py", line 29, in <module>
    from torch.testing._internal.common_utils import TestCase, iter_indices, TEST_NUMPY, TEST_SCIPY, \
ImportError: cannot import name 'wrapDeterministicFlagAPITest'

albanD · August 21, 2020, 4:47pm

Hi,

This kind of error AttributeError: module 'torch' has no attribute '_foreach_add' where you’re missing a cpp API usually means that your python code for your install is not the same version as the bianry install.

That happens if you do setup.py develop and then update your local repo.
You might want to clean up your install here.

JuanFMontesinos · August 23, 2020, 11:30am

Soo if my binary install is 1.6.0
I understand that I should use the tests under the tag
1.6.fix4 for example?
There are several 1.6 tags in GitHub

albanD · August 24, 2020, 4:27pm

If you’re using a binary install, that means that you either have both a binary and develop install at the same time that conflict. Or you have a folder called torch in your current directory.

baffo32 · March 28, 2023, 5:08pm

I’m experiencing this now on torch 1.13.1 on a K80. (the newer driver for torch 2 doesn’t support the K80 (edit: i guess this was a quirk of the image with newer nccl, maybe i will try torch 2).)

I have the tests running. It’s a little confusing because pytorch docker images can come with torch installed via conda, which can clash with a source install.

It seems ideally there’d be a way to log the cuda calls and narrow the problem down further.

I have another K80 I can swap in to try, too, although they’ve both weathered the same storms and could be broken similarly if there’s damage.

whywww · August 25, 2023, 3:24am

I encountered a similar issue when I was using a machine with multiple RTX4090 GPUs.
It can be reproduced simply by running this snippet on the console:

>>> l = torch.tensor(1, device='cuda:0')
>>> l.to('cuda:1')
tensor(0, device='cuda:1')

It happens for both pytorch 2.0.0 and 1.10.0 on two RTX4090s, but it does NOT happen on two RTX3090 GPUs (pytorch 1.10.0).

The issue can temporarily be solved by moving to CPU first and then to the second GPU:

>>> l = torch.tensor(1, device='cuda:0')
>>> l.to('cpu').to('cuda:1')
tensor(1, device='cuda:1')

Any thoughts?

JuanFMontesinos · August 25, 2023, 9:31am

I’m afraid I couldn’t find a solution. I think I did some experiments and it was machine-related but cannot really ensure.

Best
Juan

ptrblck · August 25, 2023, 1:50pm

Disable P2P access between the 4090s via NCCL_P2P_DISABLE=1 or update to the latest driver.

whywww · August 31, 2023, 3:45am

Thanks. I tried NCCL_P2P_DISABLE=1 but it doesn’t work. I don’t have access to the driver update so I cannot try this way either.