I’m using third party project for training age-gender predicting model.
That’s the problem script.
The problem is with every next epoch the memory occupied by the script increased.
The memory usage I observe via nvidia-smi
.
I can decrease my batch-size and the training cal last 2 epochs but then cuda runs out of memory again.
I suppose there’s a memory allocation in a wrong place or something like that.
Maybe someone can find such detail in a quick look.
any help would be appreciated.
That’s my error traceback:
File "train.py", line 641, in <module>
a.train_model()
File "train.py", line 294, in train_model
gender_out, age_out = self.model(inputs)
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/algernone/DNNS/age-gender-pytorch/Age-Gender-Pred/agegenpredmodel.py", line 140, in forward
last1 = self.get_resnet_convs_out(x)
File "/home/algernone/DNNS/age-gender-pytorch/Age-Gender-Pred/agegenpredmodel.py", line 83, in get_resnet_convs_out
x = self.resNet.layer1(x) # out = [N, 64, 56, 56]
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward
input = module(input)
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torchvision/models/resnet.py", line 65, in forward
out = self.bn2(out)
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 107, in forward
exponential_average_factor, self.eps)
File "/home/algernone/anaconda3/envs/scienv/lib/python3.6/site-packages/torch/nn/functional.py", line 1670, in batch_norm
training, momentum, eps, torch.backends.cudnn.enabled
RuntimeError: CUDA out of memory. Tried to allocate 196.00 MiB (GPU 0; 5.77 GiB total capacity; 3.79 GiB already allocated; 185.38 MiB free; 4.66 GiB reserved in total by PyTorch)
that happens after first epoch