How to solve the problem of `RuntimeError: all tensors must be on devices[0]`

ptrblck · April 27, 2018, 9:29am

Ok, I see.
Could you run the following code with all device ids?

dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
print(dev_props)

xu_wang · April 27, 2018, 11:30am

There are 8 GPU on the server.I run:
dev_props = [torch.cuda.get_device_properties(i) for i in range(8)]
print(dev_props)

Here is result:
[_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28)]

ptrblck · April 28, 2018, 5:29pm

That’s really strange, since the warning claims there is an imbalance between your GPUs, although they are all the same.

Let’s ignore the warning for a moment and focus on the error.
Could you try to lower your batch size and check, if your code runs?

xu_wang · May 2, 2018, 3:37am

I half the batch size from 60 to 30,it fails.
Then changed again, it fails too.
Here is the error:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

ptrblck · May 2, 2018, 9:12am

Could you check in your terminal with nvidia-smi if the GPU is full? Maybe another process is taking all memory.

xu_wang · May 3, 2018, 9:18am

May the memory mean the memory used by cpu, not those on GPU?

ptrblck · May 3, 2018, 12:06pm

I don’t think so, since it’s a CUDA error.
Did you check your RAM? The usual error message for the RAM would be something like:

RuntimeError: $ Torch: not enough memory: you tried to allocate XXGB. Buy new RAM!

xu_wang · May 4, 2018, 2:42am

OK.Thank you very much.
Yesterday, I found sum of available RAM and swap is almost 0.And five GPUs are free.
The error about GPU I encounter confuse me a lot.Your help give me a lot of inspiration.
I will synthesize all info to fix it.
Thank you again!

naifrec · June 27, 2018, 1:58pm

Hello guys

I actually had the same issue, I was pulling my hairs off my head for a few hours, and it turned out to be a very, very simple issue. The device_ids you give to replicate should be int, not str! Because mine were coming from some argparser which I mistakenly set to cast to string, everything was failing…

I know you guys are probably smart enough not to have this kind of issue, but still, might help someone like me ahah

Vincent_Zhang · November 12, 2018, 12:33pm

I have tested some cases, and find that if you use DataParallel to warp a model, the input tensor can be either on cpu or any gpu, while the model should be on dev0. You can use to('cuda:x) before DataParalle or after it, but need to guarantee the X equals dev0. Here is another example which use set_device to change current device and call model.cuda() after DataParallel.

g1910 · December 19, 2018, 5:19am

This solution did not work for me.

kaiyuyue · December 26, 2018, 1:43pm

Maybe you can check the code to make sure that there are no duplicate torch.nn.DataParallel().

isht7 · May 14, 2019, 7:21pm

Thanks very much! This fixed my error.

teng_stone · July 18, 2019, 6:56am

CUDA_VISIBLE_DEVICES=0,3相当于一个数组，把实际的卡号放到这里，这里就是卡0和卡3
the gpus config should be set [0,1].相当于取这个数组的索引为0,1的显卡，也就是取CUDA_VISIBLE_DEVICES[0],[1],也就是卡0 和卡3