How to solve the problem of `RuntimeError: all tensors must be on devices[0]`

Ok, I see.
Could you run the following code with all device ids?

dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
print(dev_props)

There are 8 GPU on the server.I run:
dev_props = [torch.cuda.get_device_properties(i) for i in range(8)]
print(dev_props)

Here is result:
[_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28)]

That’s really strange, since the warning claims there is an imbalance between your GPUs, although they are all the same.

Let’s ignore the warning for a moment and focus on the error.
Could you try to lower your batch size and check, if your code runs?

I half the batch size from 60 to 30,it fails.
Then changed again, it fails too.
Here is the error:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58

Could you check in your terminal with nvidia-smi if the GPU is full? Maybe another process is taking all memory.

May the memory mean the memory used by cpu, not those on GPU?

I don’t think so, since it’s a CUDA error.
Did you check your RAM? The usual error message for the RAM would be something like:

RuntimeError: $ Torch: not enough memory: you tried to allocate XXGB. Buy new RAM!

OK.Thank you very much.
Yesterday, I found sum of available RAM and swap is almost 0.And five GPUs are free.
The error about GPU I encounter confuse me a lot.Your help give me a lot of inspiration.
I will synthesize all info to fix it.
Thank you again!

Hello guys :slight_smile:

I actually had the same issue, I was pulling my hairs off my head for a few hours, and it turned out to be a very, very simple issue. The device_ids you give to replicate should be int, not str! Because mine were coming from some argparser which I mistakenly set to cast to string, everything was failing…

I know you guys are probably smart enough not to have this kind of issue, but still, might help someone like me ahah :wink:

1 Like

I have tested some cases, and find that if you use DataParallel to warp a model, the input tensor can be either on cpu or any gpu, while the model should be on dev0. You can use to('cuda:x) before DataParalle or after it, but need to guarantee the X equals dev0. Here is another example which use set_device to change current device and call model.cuda() after DataParallel.

This solution did not work for me. :confused:

Maybe you can check the code to make sure that there are no duplicate torch.nn.DataParallel().

6 Likes

Thanks very much! This fixed my error.

CUDA_VISIBLE_DEVICES=0,3相当于一个数组,把实际的卡号放到这里,这里就是卡0和卡3
the gpus config should be set [0,1].相当于取这个数组的索引为0,1的显卡,也就是取CUDA_VISIBLE_DEVICES[0],[1],也就是卡0 和卡3