No process is running (nivdia-smi ) but Runntime Error memory is full

mhadnanali · December 23, 2021, 4:00am

Below is the result of nivdia-smi command. Which shows that no process is running. but when i try to run my code it says

RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 3; 7.80 GiB total capacity; 6.24 GiB already allocated; 258.31 MiB free; 6.25 GiB reserved in total by PyTorch)

nivdia-smi results:

Thu Dec 23 11:52:33 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:05:00.0 Off |                  N/A |
| 30%   43C    P8    19W / 250W |      3MiB /  7981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
| 27%   41C    P8     3W / 250W |      3MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
| 27%   39C    P8     2W / 250W |      3MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 27%   35C    P8     3W / 250W |      3MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Note:
1- I already tried: torch.cuda.empty_cache()
2- This is a shared sever and I have user access not root access.

ptrblck · December 23, 2021, 8:22am

The error message explains that you are running out of memory and cannot allocate the desired ~1GB.
Your total GPU memory is given as 7.8GB, 6.25GB are already reserved by PyTorch, ~260MB are free, the rest is used by the CUDA context (since you’ve made sure no other applications are running and using the device).

mhadnanali · December 23, 2021, 8:54am

For what PyTorch is using this much memory? While I am not running anything. I feel like Pytorch has a memory problem.

ptrblck · December 23, 2021, 8:56am

The CUDA context loads the driver and all linked CUDA kernels, i.e. PyTorch native kernels, cuDNN, NCCL, etc. If you don’t want to load these kernels (and either drop the performance or remove specific utils), you could rebuild PyTorch from source without any additional libraries (i.e. NCCL, cuDNN, MAGMA,…).

mhadnanali · December 23, 2021, 9:12am

so this means that I can not execute the 1GB required codes on this server? (without changing the PyTorch source code)

ptrblck · December 23, 2021, 9:13am

Yes, that’s correct. You are running out of memory and would need to reduce the memory usage by e.g. lowering the batch size, using torch.utils.checkpoint, mixed-precision training, DistributedDataParallel, model sharding etc.

mhadnanali · December 23, 2021, 9:30am

ok, thank you. I see that