How to solve the problem of `RuntimeError: all tensors must be on devices[0]`

Is someone else using the GPU or does the server have different cards installed?
Could you explain a bit about the setup.

Might be unrelated here, but sometimes the device_ids don’t match with what nvidia_smi claims.

I was watching the nvidia-smi per 3 seconds.It seems no one was using the GPU ID 0 & 3 .
On the server there are 8 GPU installed and they are all the same type 1080Ti.
What else do you want to know?

Could you add CUDA_DEVICE_ORDER=PCI_BUS_ID before in front of your Python call?
Just curious, if the device ids are differently assigned.

Something doesn’t seem right.
The _check_balance method only checks for the GPU specs (total memory and multi processor count).
So if all cards are 1080Tis the warning shouldn’t come.

In your first attempt to use DataParallel you used device_ids=[0, 3] and got “Invalid device id” back.

Are you sure you are working on the right server?

What do you mean by the right server?I mean that if I don’t use the right server,I should not sing in successfully.
And I don’t konw what do the error and warning mean.
By the way, I set the input like this:
calculator = device(‘cuda’)
input_var = data.to(calculator)

I have a test,using the offcial example/mnist.
Change the code this way:

From:
model = Net().to(device)
To:
model = Net()
model=nn.DataParallel(model, device_ids=[0, 1]).to(device)

Then run :CUDA_VISIBLE_DEVICES=0,1 python main.py
The same warning came out appear.

Ok, I see.
Could you run the following code with all device ids?

dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
print(dev_props)

There are 8 GPU on the server.I run:
dev_props = [torch.cuda.get_device_properties(i) for i in range(8)]
print(dev_props)

Here is result:
[_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28)]

That’s really strange, since the warning claims there is an imbalance between your GPUs, although they are all the same.

Let’s ignore the warning for a moment and focus on the error.
Could you try to lower your batch size and check, if your code runs?

I half the batch size from 60 to 30,it fails.
Then changed again, it fails too.
Here is the error:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

Could you check in your terminal with nvidia-smi if the GPU is full? Maybe another process is taking all memory.

May the memory mean the memory used by cpu, not those on GPU?

I don’t think so, since it’s a CUDA error.
Did you check your RAM? The usual error message for the RAM would be something like:

RuntimeError: $ Torch: not enough memory: you tried to allocate XXGB. Buy new RAM!

OK.Thank you very much.
Yesterday, I found sum of available RAM and swap is almost 0.And five GPUs are free.
The error about GPU I encounter confuse me a lot.Your help give me a lot of inspiration.
I will synthesize all info to fix it.
Thank you again!

Hello guys :slight_smile:

I actually had the same issue, I was pulling my hairs off my head for a few hours, and it turned out to be a very, very simple issue. The device_ids you give to replicate should be int, not str! Because mine were coming from some argparser which I mistakenly set to cast to string, everything was failing…

I know you guys are probably smart enough not to have this kind of issue, but still, might help someone like me ahah :wink:

1 Like

I have tested some cases, and find that if you use DataParallel to warp a model, the input tensor can be either on cpu or any gpu, while the model should be on dev0. You can use to('cuda:x) before DataParalle or after it, but need to guarantee the X equals dev0. Here is another example which use set_device to change current device and call model.cuda() after DataParallel.

This solution did not work for me. :confused:

Maybe you can check the code to make sure that there are no duplicate torch.nn.DataParallel().

6 Likes

Thanks very much! This fixed my error.

CUDA_VISIBLE_DEVICES=0,3相当于一个数组,把实际的卡号放到这里,这里就是卡0和卡3
the gpus config should be set [0,1].相当于取这个数组的索引为0,1的显卡,也就是取CUDA_VISIBLE_DEVICES[0],[1],也就是卡0 和卡3