# nvidia-smi
Fri May 21 13:31:47 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:10.0 Off | 0 |
| N/A 33C P0 56W / 300W | 16126MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:11.0 Off | 0 |
| N/A 33C P0 57W / 300W | 1517MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:12.0 Off | 0 |
| N/A 39C P0 56W / 300W | 1519MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:13.0 Off | 0 |
| N/A 55C P0 278W / 300W | 15965MiB / 16130MiB | 97% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
x_data = torch.tensor([[1, 2],[3, 4]])
x_data.to(“cuda:0”) and x_data.to(“cuda:3”) caused: RuntimeError: CUDA error: out of memory
‘cuda:1’ and ‘cuda:2’ are fine without memory error.
Does Pytorch automatically make use of GPUs that currently have free memory? Also, the above picture shows that GPU 0 is not used at all, why it also caused out of memory issue?
No, the used GPU is specified via its index as given in your code example.
The output shows that GPU0 is almost completely filled, so I’m unsure what “not used at all” would mean in this context. Since this device has only ~4MB left, I would assume that a new allocation would raise an out of memory error.
@ptrblck When I say GPU 0 ‘not used at all’, I am looking at the next column GPU utilization, which shows it is 0% used. So why GPU is almost 0% used but at the same time memory is used up?
Another question regrading the 2nd table, why is the ‘GPU Memory Usage’ table is blank? I though it should shows the actual GPU memory usage, which is different from the gpu usage column in the first table?
The GPU utilization gives for a specific time period the percentage of time one or more GPU kernel(s) were running on the device. If your script shows a low utilization, you could profile it and check where the bottlenecks are. Usually this indicates, that the GPU is “starving”, i.e. your script cannot provide the data fast enough, which might happen for e.g. a slow data loading compared to the model execution.
The second row would show all processes using the device. If this information is empty, it could point towards permission issues on your system, so that nvidia-smi doesn’t get information about the running processes and thus cannot display the memory each one is using.
The output of the script seems to match the output of nvidia-smi (it seems to be calling it, so this would be expected) but it seems that the device ids changed this time.