Runtime error: Cuda memory

This error is showing every time i try to train the model, it happens in the very first epoch.
Is there a way to fix this?
Could someone help me with this.

Traceback (most recent call last):
  File "/home/rahul/Dev/Image-Segmentation-using-PyTorch-and-OpenCV/src/train.py", line 76, in <module>
    main()
  File "/home/rahul/Dev/Image-Segmentation-using-PyTorch-and-OpenCV/src/train.py", line 67, in main
    train_fn(train_loader, model, optimizer, loss_fn)
  File "/home/rahul/Dev/Image-Segmentation-using-PyTorch-and-OpenCV/src/train.py", line 40, in train_fn
    preds = model(data)
  File "/home/rahul/Dev_Tools/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rahul/Dev/Image-Segmentation-using-PyTorch-and-OpenCV/src/model/model.py", line 39, in forward
    x = self.pool(x)
  File "/home/rahul/Dev_Tools/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/rahul/Dev_Tools/anaconda3/lib/python3.8/site-packages/torch/nn/modules/pooling.py", line 162, in forward
    return F.max_pool2d(input, self.kernel_size, self.stride,
  File "/home/rahul/Dev_Tools/anaconda3/lib/python3.8/site-packages/torch/_jit_internal.py", line 365, in fn
    return if_false(*args, **kwargs)
  File "/home/rahul/Dev_Tools/anaconda3/lib/python3.8/site-packages/torch/nn/functional.py", line 659, in _max_pool2d
    return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 3.82 GiB total capacity; 2.15 GiB already allocated; 74.50 MiB free; 2.16 GiB reserved in total by PyTorch)

This is my GPU details, i use GTX 1650TI with 4GB VRAM

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   43C    P8     4W /  N/A |    556MiB /  3911MiB |     12%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1028      G   /usr/lib/xorg/Xorg                272MiB |
|    0   N/A  N/A      1368      G   /usr/bin/gnome-shell              131MiB |
|    0   N/A  N/A      2136      G   ...gAAAAAAAAA --shared-files       32MiB |
|    0   N/A  N/A     23039      G   ...AAAAAAAAA= --shared-files       25MiB |
|    0   N/A  N/A     25760      G   ...AAAAAAAAA= --shared-files       42MiB |
|    0   N/A  N/A     25773      G   ...AAAAAAAAA= --shared-files       47MiB |
+-----------------------------------------------------------------------------+

You would need to reduce the memory requirement of your training by e.g. reducing the batch size or by using torch.utils.checkpoint to trade compute for memory.

@ptrblck thank you for replying.
I have reduced the batch size significantly from 32 to 4 but still its not running.
Could you recommend a tutorial for this torch.utils.checkpoint?

I was using this tutorial in the past, which might be a bit old by now, but could still give you a good idea how to use it.

I will try it soon, but it’s working fine when i dont use bathces at all.
will there be any down side doing this way?

By “don’t use batches at all” I assume you are using a batch size of 1?
If so, you should be careful about layers, which depend on the batch size, such as batchnorm layers (which compute the mean and stddev from the input batch), as they might create noisy running stats.
Also, the convergence of the model would change and you might generally see a slower epoch time.