i was trying to run this code on my local machine on rtx 3060ti : Pytorch Efficientnet Baseline [Train] AMP+Aug | Kaggle
with batch size 16 i can run vovnet on my local machine,the problem that i am facing is very strange, the code worked fined,after 2-3 epoch it became very very slow and i saw in spyder that memory consumption is increasing,i just restarted spyder,removed variable and even shut down my pc several times and reduced num_workers but that same code doesn’t work anymore,now to run that code i need to set batch size = 2 instead of 16 and if i don’t do that then that same code that worked few minutes ago will show me this error :
runfile('C:/Users/Mobassir/.spyder-py3/temp.py', wdir='C:/Users/Mobassir/.spyder-py3') Using device: cuda Training with 1 started 17117 4280 model loaded 0%| | 0/1070 [00:04<?, ?it/s] Traceback (most recent call last): File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1161, in <module> train_one_epoch(epoch, model, optimizer, train_loader, device, scheduler=scheduler, schd_batch_update=False) File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1004, in train_one_epoch scaler.scale(loss).backward() File "C:\Users\Mobassir\anaconda3\envs\kaggle\lib\site-packages\torch\tensor.py", line 221, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\Mobassir\anaconda3\envs\kaggle\lib\site-packages\torch\autograd\__init__.py", line 130, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 8.00 GiB total capacity; 1.48 GiB already allocated; 0 bytes free; 2.90 GiB reserved in total by PyTorch)
out of frustration i uninstalled cuda,cudnn and reinstalled but it happens again,for the first run,i will get no error and after 2-3 epoch training will become super slow and then if i restart and retry means same code won’t work again anymore,i can;t understand how to solve this issue,is it related to autograd or what?
@ptrblck please help?
i tried to use num_worker = 4 and pin_memory = 1
is it causing issue?
how can i fix this? everything else is exactly same as this kernel : Pytorch Efficientnet Baseline [Train] AMP+Aug | Kaggle
my question :
why the code that worked 5 minutes ago,doesn’t work again anymore with that exact setting?(even after restarting computer?)
even with half batch size i get that cuda oom error.
how can i fix this?
sometimes i get this error :
Traceback (most recent call last): File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1161, in <module> train_one_epoch(epoch, model, optimizer, train_loader, device, scheduler=scheduler, schd_batch_update=False) File "C:\Users\Mobassir\.spyder-py3\temp.py", line 1009, in train_one_epoch running_loss = running_loss * .99 + loss.item() * .01 RuntimeError: CUDA error: an illegal memory access was encountered
when i re-run…
about my environment :
i used, anaconda virtual environment where i have,
- python 3.8.5
- pytorch 1.7.0
- cuda 11.0
- cudnn 8004
- gpu rtx 3060ti
- Is CUDA available: Yes