## 🐛 Bug
A possible CPU-side memory leak even when fitting on the GPU using P…yTorch 0.4.1.
## To Reproduce
I am quite new to PyTorch, having used TF/Keras extensively in the past, but am now trying to use PyTorch as a replacement. I decided to start small with a seq2seq Skip-Thought model, cobbled together using the PyTorch NLP tutorials. Everything seems to work fine when I run small scale tests, however, when I use the code to run a large scale fit on 3000 separate paragraphs (each paragraph having a variable number of sentences) I notice that my system RAM usage slowly goes up as the script runs, until eventually it hits 100% and the box becomes unresponsive and has to be force rebooted. The Linux box has 64 GB of RAM, and when the script starts, usage is 4.7% but this climbs steadily over time to 100%, at which point the box becomes unresponsive.
Since I'm new to PyTorch, I'm not sure if perhaps I'm doing something blatantly wrong in my code which would account for this behaviour?
This is my core fitting logic, which runs as a script on linux:
```
device = 'cuda'
model = SkipThought(len(text_dictionary.token2id), 128, 256, nn.NLLLoss()).to(device)
// corpus_orig is a list of lists where each list-element is a text paragraph with multiple sentences
n_iters=len(corpus_orig)
start = time.time()
print_loss_total = 0
optimizer = optim.SGD(model.parameters(), lr=0.01)
for xi in range(1, n_iters + 1):
x = corpus_orig[xi - 1]
sents = [x for x in map(str.strip, x.split('. ')) if len(x) > 0]
for i in range(1, len(sents) - 1):
input_tensor, prev_tensor, next_tensor = tensorsFromPair( (sents[i], sents[i - 1], sents[i + 1]) )
optimizer.zero_grad()
loss, prev_output, next_output = model(input_tensor, prev_tensor, next_tensor, use_teacher_forcing=True)
loss.backward(retain_graph=True) // pytorch says I need retain_graph=True, otherwise I get an error here
optimizer.step()
print_loss_total += loss.item()
print_loss_avg = print_loss_total
print_loss_total = 0
print('%s (%d %d%%) %.4f' % (timeSince(start, xi / n_iters),
xi, xi / n_iters * 100, print_loss_avg), flush=True)
```
Here are the helper functions which create the tensors passed to the model by converting all text words into indices from a predefined gensim dictionary:
```
def indexesFromSentence(sentence):
return text_dictionary.doc2idx(sentence.split())
def tensorFromSentence(sentence):
indexes = indexesFromSentence(sentence)
indexes.append(EOS_token)
return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)
def tensorsFromPair(pair):
input_tensor = tensorFromSentence(pair[0])
prev_tensor = tensorFromSentence(pair[1])
next_tensor = tensorFromSentence(pair[2])
return (input_tensor, prev_tensor, next_tensor)
```
And, finally, here is the SkipThought model itself, with all its helper classes (apologies for the wall of code):
```
class EncoderRNN(nn.Module):
def __init__(self, vocab_size, embedding_size, gru_size):
super(EncoderRNN, self).__init__()
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.gru_size = gru_size
self.embedding = nn.Embedding(vocab_size, embedding_size)
self.gru = nn.GRU(embedding_size, gru_size)
def forward(self, sentence, hidden):
embeddings = self.embedding(sentence)
embeddings = F.tanh(embeddings)
output, hidden = self.gru(embeddings, hidden)
return output, hidden
def initHidden(self):
return torch.zeros(1, 1, self.gru_size, device=device)
def getEmbedding(self, sentence):
return F.tanh(self.embedding(sentence))
class LocalAttention(nn.Module):
def __init__(self, dim):
super(LocalAttention, self).__init__()
self.W = nn.Linear(dim, dim, bias=False)
def score(self, decoder_hidden, encoder_out):
encoder_out = self.W(encoder_out)
encoder_out = encoder_out.permute(1, 0, 2)
return encoder_out @ decoder_hidden.permute(1, 2, 0)
def forward(self, decoder_hidden, encoder_out):
energies = self.score(decoder_hidden, encoder_out)
mask = F.softmax(energies, dim=1)
context = encoder_out.permute(1, 2, 0) @ mask
context = context.permute(2, 0, 1)
mask = mask.permute(2, 0, 1)
return context, mask
class SkipThought(nn.Module):
def __init__(self, vocab_size, embedding_size, gru_size, criterion):
super(SkipThought, self).__init__()
self.vocab_size = vocab_size
self.embedding_size = embedding_size
self.gru_size = gru_size
self.criterion = criterion
self.encoder = EncoderRNN(vocab_size, embedding_size, gru_size)
self.prev_gru = nn.GRU(embedding_size + gru_size, gru_size)
self.next_gru = nn.GRU(embedding_size + gru_size, gru_size)
self.attention = LocalAttention(gru_size)
self.worder = nn.Linear(gru_size * 2, vocab_size)
self.softmax = nn.LogSoftmax(dim=1)
self.encoder_hidden = self.encoder.initHidden()
def forward(self, input_tensor, prev_tensor, next_tensor, use_teacher_forcing=True):
encoder_hidden = self.encoder_hidden
prev_length = prev_tensor.size(0)
next_length = next_tensor.size(0)
loss = 0
encoder_output, encoder_hidden = self.encoder(input_tensor, encoder_hidden)
prev_input = torch.tensor([[SOS_token]], device=device)
next_input = torch.tensor([[SOS_token]], device=device)
prev_hidden = encoder_hidden
next_hidden = encoder_hidden
self.encoder_hidden = encoder_hidden
prev_output = []
next_output = []
if use_teacher_forcing:
# Teacher forcing: Feed the target as the next input
for di in range(prev_length):
embedded = self.encoder.getEmbedding(prev_input)
context, _ = self.attention(prev_hidden, encoder_output)
decoder_output, prev_hidden = self.prev_gru(torch.cat([embedded, context], dim=2), prev_hidden)
decoder_output = self.softmax(self.worder(torch.cat([decoder_output, context], dim=2)[0]))
loss += self.criterion(decoder_output, prev_tensor[di])
prev_output.append( decoder_output.topk(1)[1].squeeze().detach().item() )
prev_input = prev_tensor[di].unsqueeze(0) # Teacher forcing
for di in range(next_length):
embedded = self.encoder.getEmbedding(next_input)
context, _ = self.attention(next_hidden, encoder_output)
decoder_output, next_hidden = self.next_gru(torch.cat([embedded, context], dim=2), next_hidden)
decoder_output = self.softmax(self.worder(torch.cat([decoder_output, context], dim=2)[0]))
loss += self.criterion(decoder_output, next_tensor[di])
next_output.append( decoder_output.topk(1)[1].squeeze().detach().item() )
next_input = next_tensor[di].unsqueeze(0) # Teacher forcing
else:
# Without teacher forcing: use its own predictions as the next input
for di in range(prev_length):
embedded = self.encoder.getEmbedding(prev_input)
context, _ = self.attention(prev_hidden, encoder_output)
decoder_output, prev_hidden = self.prev_gru(torch.cat([embedded, context], dim=2), prev_hidden)
decoder_output = self.softmax(self.worder(torch.cat([decoder_output, context], dim=2)[0]))
topv, topi = decoder_output.topk(1)
prev_input = topi.squeeze().detach() # detach from history as input
loss += self.criterion(decoder_output, prev_tensor[di])
prev_output.append( prev_input.item() )
if prev_input.item() == EOS_token:
break
prev_input = prev_input.unsqueeze(0).unsqueeze(0)
for di in range(next_length):
embedded = self.encoder.getEmbedding(next_input)
context, _ = self.attention(next_hidden, encoder_output)
decoder_output, next_hidden = self.next_gru(torch.cat([embedded, context], dim=2), next_hidden)
decoder_output = self.softmax(self.worder(torch.cat([decoder_output, context], dim=2)[0]))
topv, topi = decoder_output.topk(1)
next_input = topi.squeeze().detach() # detach from history as input
loss += self.criterion(decoder_output, next_tensor[di])
next_output.append( next_input.item() )
if next_input.item() == EOS_token:
break
next_input = next_input.unsqueeze(0).unsqueeze(0)
return loss, prev_output, next_output
```
## Expected behavior
Memory usage in Linux should not linearly increase as the fitting script runs, especially not to the point at which the box dies.
## Environment
- PyTorch Version (e.g., 1.0): 0.4.1
- OS (e.g., Linux): Linux (Ubuntu 16.04)
- How you installed PyTorch (`conda`, `pip`, source): pip install torch
- Build command you used (if compiling from source): N/A
- Python version: 3.6.6
- CUDA/cuDNN version: CUDA 9.0.176 / CuDNN 7.4.1.5
- GPU models and configuration: Tesla V100
- Any other relevant information: