I have some code that runs fine on my laptop (macOS, 2.3 GHz Intel Core i5, 16 GB memory), but fails on a GPU. On my laptop, I can run this fine:
>>> import torch
>>> x = torch.randn(70000, 16)
>>> y = torch.randn(16, 70000)
>>> z = torch.matmul(x, y)
But when I try to run this same code on a GPU, it fails:
>>> import torch
>>> device = torch.cuda.current_device()
>>> x = torch.randn(70000, 16).to(device)
>>> y = torch.randn(16, 70000).to(device)
>>> z = torch.matmul(x, y)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.cu:58
Any reason why and any way I can get this to run on a GPU?
Since the result of your matrix multiplication will have the shape [70000, 70000] using torch.float32, it should take approx. 70000**2 * 4 / 1024**3 = 18GB. Probably your laptop is using its swap to get some additional memory. Could you check if that’s the case?
Thanks for the quick reply. I’ve never checked for swapping before, but I performed the matrix multiplication with my activity monitor open and “swap used” went from 2GB to 12GB.
Is there anyway to perform the above matrix multiplication a GPU or do I need to move everything to a CPU? (I also verified that I can do this operation on my university’s cluster’s head node CPU.)
If you were using some flavour of Linux you could easily monitor GPU memory allocations with nvidia-smi command. Probably mac has something similar. For instance on my PC I see the following:
$ nvidia-smi
Tue Nov 20 00:13:02 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.66 Driver Version: 410.66 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1080 Off | 00000000:03:00.0 On | N/A |
| 0% 47C P2 44W / 200W | 1418MiB / 8118MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 804 G /usr/lib/Xorg 398MiB |
| 0 2462 G ...uest-channel-token=17047571084676707762 141MiB |
| 0 5242 C /usr/bin/python 875MiB |
+-----------------------------------------------------------------------------+
I’ve used this calculation a few times to estimate how many GB a tensor is, but I don’t actually understand it. Would you mind explaining the * 4 and / 1024**3 parts?
Sure!
I multiply each value by 4, since a float32 takes 4 Bytes (4*8=32). So if you are dealing with 70,000float32 values, you should allocate 70,000*4 = 280,000 Bytes. To convert Bytes to GB, you would have to divide it three times by 1024, which is where the / 1024**3 comes from.