I’m studying the modification of SSD model and referring here.
What puzzles me is that there will be a problem in the calculation process of the VGG layer.
train.py
...
for iteration in range(args.start_iter, cfg['max_iter']):
...
out = net(images)
...
ssd.py
def forward(self, x):
for k in range(23):
x = self.vgg[k](x).detach()
print(torch.cuda.memory_allocated() / 1024**2)
- When batch_size is set to 16, the process of training the model can run normally.
Below is what it prints:
461.8271484375
461.8271484375
461.3896484375
461.3896484375
197.7177734375
285.6083984375
285.6083984375
285.6083984375
285.6083984375
153.7724609375
197.7177734375
197.7177734375
197.7177734375
197.7177734375
197.7177734375
197.7177734375
132.3896484375
154.9521484375
154.9521484375
154.9521484375
154.9521484375
154.9521484375
154.9521484375
- When batch_size is set to 32, it will CUDA out of memory.
Below is what it prints:
830.306640625
830.306640625
Traceback (most recent call last):
…
RuntimeError: CUDA out of memory. Tried to allocate 3.09 GiB (GPU 0; 8.00 GiB total capacity; 1.50 GiB already allocated; 3.52 GiB free; 2.44 GiB reserved in total by PyTorch)
- And even sometimes it will OOM when batch_size is set to 16,
Then it can run nomally after reboot.
Below is what it prints when OOM:
461.8349609375
461.8349609375
461.3974609375
461.3974609375
197.7255859375
285.6162109375
285.6162109375
285.6162109375
285.6162109375
153.7802734375
197.7255859375
197.7255859375
197.7255859375
197.7255859375
197.7255859375
197.7255859375
132.3974609375
Traceback (most recent call last):
…
RuntimeError: CUDA out of memory. Tried to allocate 4.26 GiB (GPU 0; 8.00 GiB total capacity; 177.52 MiB already allocated;
5.51 GiB free; 474.00 MiB reserved in total by PyTorch)
The questions I want to ask:
-
Why does the calculation of the VGG layer cause CUDA OOM?
This is NOT the process of inputting the image to the GPU. -
Why it can loop for twice? (prints twice) when batch_size is set to 32.
-
Why OOM happens a bit randomly? is something not cleaned up in the GPU during the last run? (something that can only be cleaned by reboot?)
-
How to solve this problem?
OS: Windows 10
Python: 3.6.8
Pytorch: 1.5.1
nvcc: 10.2
GPU: NVIDIA GeForce RTX 2070 SUPER (8192MB)