CUDA runs out of memory after some epochs

Chame_call · March 7, 2020, 6:36pm

I’m using third party project for training age-gender predicting model.

That’s the problem script.
The problem is with every next epoch the memory occupied by the script increased.
The memory usage I observe via nvidia-smi.

I can decrease my batch-size and the training cal last 2 epochs but then cuda runs out of memory again.
I suppose there’s a memory allocation in a wrong place or something like that.
Maybe someone can find such detail in a quick look.
any help would be appreciated.

Chame_call · March 7, 2020, 6:37pm

That’s my error traceback:

File "train.py", line 641, in <module>
    a.train_model()
  File "train.py", line 294, in train_model
    gender_out, age_out = self.model(inputs)
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algernone/DNNS/age-gender-pytorch/Age-Gender-Pred/agegenpredmodel.py", line 140, in forward
    last1 = self.get_resnet_convs_out(x)
  File "/home/algernone/DNNS/age-gender-pytorch/Age-Gender-Pred/agegenpredmodel.py", line 83, in get_resnet_convs_out
    x = self.resNet.layer1(x)   # out = [N, 64, 56, 56]
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
    input = module(input)
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torchvision/models/resnet.py", line 65, in forward
    out = self.bn2(out)
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
    exponential_average_factor, self.eps)
  File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/functional.py", line 1670, in batch_norm
    training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 5.77 GiB total capacity; 3.79 GiB already allocated; 185.38 MiB free; 4.66 GiB reserved in total by PyTorch)

that happens after first epoch