I am encountering the following error during my training run:
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 11.93 GiB total capacity; 11.32 GiB already allocated; 81.06 MiB free; 72.23 MiB cached)
I have tried the following approaches to solve the issue, all to no avail:
-
reduce batch size, all the way down to 1
-
remove everything to CPU leaving only the network on the GPU
-
remove validation code, and only executing the training code
-
reduce the size of the network (I reduced it significantly: details below)
-
I tried scaling the magnitude of the loss that is backpropagating as well to a much smaller value
None of the above have worked. My code is crashing after just a few batches into the very first epoch. Depending on the batch size, it either crashes after a few more batches (when batch size is smaller) or less batches ( when batch size is larger ( say 16) )
Typically it crashes after around 5 to 8 batches of the 1st epoch
The following is the entire trace back:
sys:1: RuntimeWarning: Traceback of forward call that caused the error:
File "main.py", line 411, in <module>
intensity.cnn_mse(au_net, config)
File "/work/satrajic/FacialExpression/Experiments/intensity.py", line 1431, in cnn_mse
corr_factor = corr_net(inputs).reshape([BATCH_SIZE, 2, 5]).cpu()
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/work/satrajic/FacialExpression/Experiments/base_model.py", line 701, in forward
x = self.features(x)
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward
input = module(input)
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/modules/pooling.py", line 146, in forward
self.return_indices)
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/_jit_internal.py", line 133, in fn
return if_false(*args, **kwargs)
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 494, in _max_pool2d
input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
File "main.py", line 411, in <module>
intensity.cnn_mse(au_net, config)
File "/work/satrajic/FacialExpression/Experiments/intensity.py", line 1474, in cnn_mse
loss.backward()
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/satrajic/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 98.00 MiB (GPU 0; 11.93 GiB total capacity; 11.32 GiB already allocated; 80.06 MiB free; 72.23 MiB cached)
It seems like the issue is happening during the backward pass, while it is trying to store the gradients. Something seems to be ballooning in memory size. However I cannot figure out what is exactly causing it to do so, or how I should fix it.
Something of note is that the loss values are very very large. I tried manually scaling them down by orders of magnitude. But that did not work either.
I would really appreciate any help on this as I am completely perplexed