Ok, I see.
Could you run the following code with all device ids?
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
print(dev_props)
Ok, I see.
Could you run the following code with all device ids?
dev_props = [torch.cuda.get_device_properties(i) for i in device_ids]
print(dev_props)
There are 8 GPU on the server.I run:
dev_props = [torch.cuda.get_device_properties(i) for i in range(8)]
print(dev_props)
Here is result:
[_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28),
_CudaDeviceProperties(name=‘GeForce GTX 1080 Ti’, major=6, minor=1, total_memory=11172MB, multi_processor_count=28)]
That’s really strange, since the warning claims there is an imbalance between your GPUs, although they are all the same.
Let’s ignore the warning for a moment and focus on the error.
Could you try to lower your batch size and check, if your code runs?
I half the batch size from 60 to 30,it fails.
Then changed again, it fails too.
Here is the error:
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
Could you check in your terminal with nvidia-smi
if the GPU is full? Maybe another process is taking all memory.
May the memory mean the memory used by cpu, not those on GPU?
I don’t think so, since it’s a CUDA error.
Did you check your RAM? The usual error message for the RAM would be something like:
RuntimeError: $ Torch: not enough memory: you tried to allocate XXGB. Buy new RAM!
OK.Thank you very much.
Yesterday, I found sum of available RAM and swap is almost 0.And five GPUs are free.
The error about GPU I encounter confuse me a lot.Your help give me a lot of inspiration.
I will synthesize all info to fix it.
Thank you again!
Hello guys
I actually had the same issue, I was pulling my hairs off my head for a few hours, and it turned out to be a very, very simple issue. The device_ids
you give to replicate
should be int
, not str
! Because mine were coming from some argparser
which I mistakenly set to cast to string, everything was failing…
I know you guys are probably smart enough not to have this kind of issue, but still, might help someone like me ahah
I have tested some cases, and find that if you use DataParallel to warp a model, the input tensor can be either on cpu or any gpu, while the model should be on dev0. You can use to('cuda:x) before DataParalle or after it, but need to guarantee the X equals dev0. Here is another example which use set_device to change current device and call model.cuda() after DataParallel.
This solution did not work for me.
Maybe you can check the code to make sure that there are no duplicate torch.nn.DataParallel()
.
Thanks very much! This fixed my error.
CUDA_VISIBLE_DEVICES=0,3相当于一个数组,把实际的卡号放到这里,这里就是卡0和卡3
the gpus config should be set [0,1].相当于取这个数组的索引为0,1的显卡,也就是取CUDA_VISIBLE_DEVICES[0],[1],也就是卡0 和卡3