Pytorch num_worker>0 code worked first time and then it never worked with same setting again

mobassir94 · January 30, 2021, 7:27pm

i was trying to run this code on my local machine on rtx 3060ti : Pytorch Efficientnet Baseline [Train] AMP+Aug | Kaggle

with batch size 16 i can run vovnet on my local machine,the problem that i am facing is very strange, the code worked fined,after 2-3 epoch it became very very slow and i saw in spyder that memory consumption is increasing,i just restarted spyder,removed variable and even shut down my pc several times and reduced num_workers but that same code doesn’t work anymore,now to run that code i need to set batch size = 2 instead of 16 and if i don’t do that then that same code that worked few minutes ago will show me this error :

runfile('C:/Users/Mobassir/.spyder-py3/temp.py', wdir='C:/Users/Mobassir/.spyder-py3')
Using device: cuda
Training with 1 started
17117 4280
model loaded
  0%|          | 0/1070 [00:04<?, ?it/s]
Traceback (most recent call last):

  File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1161, in <module>
    train_one_epoch(epoch, model, optimizer, train_loader, device, scheduler=scheduler, schd_batch_update=False)

  File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1004, in train_one_epoch
    scaler.scale(loss).backward()

  File "C:\Users\Mobassir\anaconda3\envs\kaggle\lib\site-packages\torch\tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)

  File "C:\Users\Mobassir\anaconda3\envs\kaggle\lib\site-packages\torch\autograd\__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 8.00 GiB total capacity; 1.48 GiB already allocated; 0 bytes free; 2.90 GiB reserved in total by PyTorch)

out of frustration i uninstalled cuda,cudnn and reinstalled but it happens again,for the first run,i will get no error and after 2-3 epoch training will become super slow and then if i restart and retry means same code won’t work again anymore,i can;t understand how to solve this issue,is it related to autograd or what?

@ptrblck please help?
i tried to use num_worker = 4 and pin_memory = 1
is it causing issue?
how can i fix this? everything else is exactly same as this kernel : Pytorch Efficientnet Baseline [Train] AMP+Aug | Kaggle

my question :

why the code that worked 5 minutes ago,doesn’t work again anymore with that exact setting?(even after restarting computer?)
even with half batch size i get that cuda oom error.
how can i fix this?

sometimes i get this error :

Traceback (most recent call last):

  File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1161, in <module>
    train_one_epoch(epoch, model, optimizer, train_loader, device, scheduler=scheduler, schd_batch_update=False)

  File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1009, in train_one_epoch
    running_loss = running_loss * .99 + loss.item() * .01

RuntimeError: CUDA error: an illegal memory access was encountered

when i re-run…

about my environment :

i used, anaconda virtual environment where i have,

python 3.8.5
pytorch 1.7.0
cuda 11.0
cudnn 8004
gpu rtx 3060ti
Is CUDA available: Yes
spyder

ptrblck · January 31, 2021, 7:35am

This error message:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 8.00 GiB total capacity; 1.48 GiB already allocated; 0 bytes free; 2.90 GiB reserved in total by PyTorch)

suggests, that other processes are using the GPU and are allocating memory.
You can check it via nvidia-smi and see, how much memory is used before running your script.

mobassir94 · January 31, 2021, 8:07am

@ptrblck

why it will use my memory even after computer restart?
if i set num_workers = 0 then it works but very slow,if i use any other non zero value then memory consumption gradually increases and gives OOM
i restored sharded cache from nvidia control panel but same issue

mobassir94 · January 31, 2021, 1:07pm

i used this computer 15-17 hours ago,now again using it and again facing same problem, now even with num_workers = 0 my code is not working and giving OOM,here is my nvidia-smi

+-----------------------------------------------------------------------------+
Sun Jan 31 19:04:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 457.51       Driver Version: 457.51       CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 306... WDDM  | 00000000:09:00.0  On |                  N/A |
| 59%   50C    P2   112W / 240W |   8033MiB /  8192MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1320    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A      3236    C+G   ...lication\WebCompanion.exe    N/A      |
|    0   N/A  N/A      5468    C+G   ...ekyb3d8bbwe\YourPhone.exe    N/A      |
|    0   N/A  N/A      6264      C   ...a3\envs\kaggle\python.exe    N/A      |
|    0   N/A  N/A      7136      C   Insufficient Permissions        N/A      |
|    0   N/A  N/A      7908    C+G   ...me\Application\chrome.exe    N/A      |
|    0   N/A  N/A      8560    C+G   ...5n1h2txyewy\SearchApp.exe    N/A      |
|    0   N/A  N/A     10632    C+G   C:\Windows\explorer.exe         N/A      |
|    0   N/A  N/A     11108    C+G   Insufficient Permissions        N/A      |
|    0   N/A  N/A     11744    C+G   ...artMenuExperienceHost.exe    N/A      |
|    0   N/A  N/A     11984    C+G   ...nputApp\TextInputHost.exe    N/A      |
|    0   N/A  N/A     13232    C+G   ...cw5n1h2txyewy\LockApp.exe    N/A      |
+-----------------------------------------------------------------------------+

ptrblck · January 31, 2021, 9:01pm

There seem to be multiple issues based on your description:

if you are seeing a memory increase by running the script, this points to the common issue of storing tensors, which are attached to the computation graph, and thus PyTorch won’t be able to free it. Often this is done by e.g. appending the output or loss tensors to a list without detaching.
if you are seeing the same error message right after restarting the machine, your OS might use GPU memory and thus not the complete GPU is available to PyTorch (if you have a monitor plugged into the GPU, it’ll use memory for the display output)
if you are seeing a clean GPU at the beginning and see an increased memory usage with a DataLoader and multiple workers only, please post a code snippet to reproduce this issue.