No process is running (nivdia-smi ) but Runntime Error memory is full

Below is the result of nivdia-smi command. Which shows that no process is running. but when i try to run my code it says

RuntimeError: CUDA out of memory. Tried to allocate 1.02 GiB (GPU 3; 7.80 GiB total capacity; 6.24 GiB already allocated; 258.31 MiB free; 6.25 GiB reserved in total by PyTorch)

nivdia-smi results:

Thu Dec 23 11:52:33 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:05:00.0 Off |                  N/A |
| 30%   43C    P8    19W / 250W |      3MiB /  7981MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:06:00.0 Off |                  N/A |
| 27%   41C    P8     3W / 250W |      3MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:09:00.0 Off |                  N/A |
| 27%   39C    P8     2W / 250W |      3MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:0A:00.0 Off |                  N/A |
| 27%   35C    P8     3W / 250W |      3MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Note:
1- I already tried: torch.cuda.empty_cache()
2- This is a shared sever and I have user access not root access.

The error message explains that you are running out of memory and cannot allocate the desired ~1GB.
Your total GPU memory is given as 7.8GB, 6.25GB are already reserved by PyTorch, ~260MB are free, the rest is used by the CUDA context (since you’ve made sure no other applications are running and using the device).

For what PyTorch is using this much memory? While I am not running anything. I feel like Pytorch has a memory problem.

The CUDA context loads the driver and all linked CUDA kernels, i.e. PyTorch native kernels, cuDNN, NCCL, etc. If you don’t want to load these kernels (and either drop the performance or remove specific utils), you could rebuild PyTorch from source without any additional libraries (i.e. NCCL, cuDNN, MAGMA,…).

so this means that I can not execute the 1GB required codes on this server? (without changing the PyTorch source code)

Yes, that’s correct. You are running out of memory and would need to reduce the memory usage by e.g. lowering the batch size, using torch.utils.checkpoint, mixed-precision training, DistributedDataParallel, model sharding etc.

1 Like

ok, thank you. I see that