Why is out-of-memory for the little example?

lingvisa · May 21, 2021, 2:50am

My GPU usage:

# nvidia-smi
Fri May 21 13:31:47 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:10.0 Off |                    0 |
| N/A   33C    P0    56W / 300W |  16126MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:11.0 Off |                    0 |
| N/A   33C    P0    57W / 300W |   1517MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:12.0 Off |                    0 |
| N/A   39C    P0    56W / 300W |   1519MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:13.0 Off |                    0 |
| N/A   55C    P0   278W / 300W |  15965MiB / 16130MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

x_data = torch.tensor([[1, 2],[3, 4]])
x_data.to(“cuda:0”) and x_data.to(“cuda:3”) caused:
RuntimeError: CUDA error: out of memory

‘cuda:1’ and ‘cuda:2’ are fine without memory error.

Does Pytorch automatically make use of GPUs that currently have free memory? Also, the above picture shows that GPU 0 is not used at all, why it also caused out of memory issue?

ptrblck · May 21, 2021, 9:54am

No, the used GPU is specified via its index as given in your code example.

The output shows that GPU0 is almost completely filled, so I’m unsure what “not used at all” would mean in this context. Since this device has only ~4MB left, I would assume that a new allocation would raise an out of memory error.

lingvisa · May 21, 2021, 4:27pm

@ptrblck When I say GPU 0 ‘not used at all’, I am looking at the next column GPU utilization, which shows it is 0% used. So why GPU is almost 0% used but at the same time memory is used up?

Another question regrading the 2nd table, why is the ‘GPU Memory Usage’ table is blank? I though it should shows the actual GPU memory usage, which is different from the gpu usage column in the first table?

lingvisa · May 21, 2021, 4:59pm

Also, I found this little code to print out the currently gpu usage, which shows the first 2 gpu is not used:

In [1]: import nvidia_smi
   ...: 
   ...: nvidia_smi.nvmlInit()
   ...: 
   ...: deviceCount = nvidia_smi.nvmlDeviceGetCount()
   ...: for i in range(deviceCount):
   ...:     handle = nvidia_smi.nvmlDeviceGetHandleByIndex(i)
   ...:     info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
   ...:     print("Device {}: {}, Memory : ({:.2f}% free): {}(total), {} (free), {} (used)".format(i, nvidia_smi.nvmlDeviceGetName(handle), 100*info.free/info.total, info.total, info.free, info.used))
   ...: 
   ...: nvidia_smi.nvmlShutdown()
Device 0: b'Tesla V100-SXM2-16GB', Memory : (100.00% free): 16914055168(total), 16913989632 (free), 65536 (used)
Device 1: b'Tesla V100-SXM2-16GB', Memory : (100.00% free): 16914055168(total), 16913989632 (free), 65536 (used)
Device 2: b'Tesla V100-SXM2-16GB', Memory : (3.82% free): 16914055168(total), 646053888 (free), 16268001280 (used)
Device 3: b'Tesla V100-SXM2-16GB', Memory : (3.82% free): 16914055168(total), 646053888 (free), 16268001280 (used)

ptrblck · May 22, 2021, 7:07am

The GPU utilization gives for a specific time period the percentage of time one or more GPU kernel(s) were running on the device. If your script shows a low utilization, you could profile it and check where the bottlenecks are. Usually this indicates, that the GPU is “starving”, i.e. your script cannot provide the data fast enough, which might happen for e.g. a slow data loading compared to the model execution.

The second row would show all processes using the device. If this information is empty, it could point towards permission issues on your system, so that nvidia-smi doesn’t get information about the running processes and thus cannot display the memory each one is using.

The output of the script seems to match the output of nvidia-smi (it seems to be calling it, so this would be expected) but it seems that the device ids changed this time.