I run into a cuda out of memory error when I increase the data size for torch svd cuda.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu
line=66 error=2 : out of memory
Traceback (most recent call last):
File "svd-gpu.py", line 9, in <module>
u, s, v = torch.svd(x, some=True);
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-
bld/pytorch_1503970438496/work/torch/lib/THC/generic/THCStorage.cu:66
torch.svd(x) works well for x=3M300 but fails for x=3.5M300.
I have a total of 8 Tesla K80 GPUs each having 12GB memory. Having access to all 8 GPUs doesn’t seem to make a difference so I tried an experiment as follows.
I set CUDA_VISIBLE_DEVICES to 0 and then nvidia-smi shows 11392MiB (out of 11439MiB) memory consumed for GPU 0 and the program fails. This I understand.
I set CUDA_VISIBLE_DEVICES to 0,1 and then nvidia-smi shows 8245MiB (out of 11439MiB) memory consumed for GPU 0 and around 2 MiB (out of 11439MiB) memory consumed for GPU 1 and the program fails.
Does having more GPUs or memory have no effect on svd? how does the calculation happen? Regardless of having more GPUs is only 1 GPU used to calculate svd in PyTorch?