Loss.backward() cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

I am running a CNN-HRNN model. CNN is a pre-trained VGG16 Encoder net. HRNN is a Hierachical RNN Language Model for image description generation. and I load the encoder on multi-GPU(like this model.Encoder = torch.nn.DataParallel(model.Encoder, device_ids=[0,1,2,7])).

When I am using DataLoader for loading one mini part of my dataset(about 4500 samples) ,It runs well with batch size=160. But when I am using DataLoader for loading more dataset(more than 12000 samples ), with the same 160 batch size,it runs out of GPU memory after several iters. Why?

I am getting this error:

THCudaCheck FAIL file=/pytorch/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
  File "train.v1.py", line 472, in <module>
  File "train.v1.py", line 213, in main
    train(train_loader, model, criterion, optimizer, epoch)
  File "train.v1.py", line 328, in train
  File "/home/bbbian/local/anaconda/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/bbbian/local/anaconda/lib/python2.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/torch/lib/THC/generic/THCStorage.cu:58

Here is some of my train code:

for i, data in enumerate(train_loader):
     img=Variable ...
     word_outputs, sent_outputs = model(img, inputs)
     wordRNN_loss = criterion[0](word_outputs, inputs[:, :, 1:], inputs_mask[:, :, 1:])
     sentRNN_loss = criterion[1](sent_outputs, inputs_num)
     wordRNN_losses.update(wordRNN_loss.data[0], data[0].size(0))
     sentRNN_losses.update(sentRNN_loss.data[0], data[0].size(0))
     # combined loss
     loss = sentRNN_loss * opts.lambda_sent +  wordRNN_loss * opts.lambda_word