Matrix multiplication: code runs on my laptop but "CUDA out of memory" on GPU

I have some code that runs fine on my laptop (macOS, 2.3 GHz Intel Core i5, 16 GB memory), but fails on a GPU. On my laptop, I can run this fine:

>>> import torch
>>> x = torch.randn(70000, 16)
>>> y = torch.randn(16, 70000)
>>> z = torch.matmul(x, y)

But when I try to run this same code on a GPU, it fails:

>>> import torch
>>> device = torch.cuda.current_device()
>>> x = torch.randn(70000, 16).to(device)
>>> y = torch.randn(16, 70000).to(device)
>>> z = torch.matmul(x, y)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.cu:58

Any reason why and any way I can get this to run on a GPU?

Since the result of your matrix multiplication will have the shape [70000, 70000] using torch.float32, it should take approx. 70000**2 * 4 / 1024**3 = 18GB. Probably your laptop is using its swap to get some additional memory. Could you check if that’s the case?

1 Like

Thanks for the quick reply. I’ve never checked for swapping before, but I performed the matrix multiplication with my activity monitor open and “swap used” went from 2GB to 12GB.

Is there anyway to perform the above matrix multiplication a GPU or do I need to move everything to a CPU? (I also verified that I can do this operation on my university’s cluster’s head node CPU.)

I don’t know, what you are actually calculating, but would it be possible to use some kind of batched approach or take a subset of these matrices?

If you were using some flavour of Linux you could easily monitor GPU memory allocations with nvidia-smi command. Probably mac has something similar. For instance on my PC I see the following:

$ nvidia-smi 
Tue Nov 20 00:13:02 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.66       Driver Version: 410.66       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1080    Off  | 00000000:03:00.0  On |                  N/A |
|  0%   47C    P2    44W / 200W |   1418MiB /  8118MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       804      G   /usr/lib/Xorg                                398MiB |
|    0      2462      G   ...uest-channel-token=17047571084676707762   141MiB |
|    0      5242      C   /usr/bin/python                              875MiB |
+-----------------------------------------------------------------------------+

Sorry, I forgot to reply to this. Anyway, yes, I can decompose the computation. Will look into that. Thanks.

I’ve used this calculation a few times to estimate how many GB a tensor is, but I don’t actually understand it. Would you mind explaining the * 4 and / 1024**3 parts?

Sure!
I multiply each value by 4, since a float32 takes 4 Bytes (4*8=32). So if you are dealing with 70,000 float32 values, you should allocate 70,000*4 = 280,000 Bytes. To convert Bytes to GB, you would have to divide it three times by 1024, which is where the / 1024**3 comes from.

1 Like