Model using too much memory when initialising

I am trying to make a headline generator. But when I initialise my model it keeps on crashing. I checked my memory usage and saw that it was using 50GB of memory.

import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)
        self.o2o = nn.Linear(hidden_size + output_size, output_size)
        self.dropout = nn.Dropout(0.1)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        input_combined =, hidden), 1)
        hidden = self.i2h(input_combined)
        output = self.i2o(input_combined)
        output_combined =, output), 1)
        output = self.o2o(output_combined)
        output = self.dropout(output)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

Here is how I initialise it

rnn = RNN(len(vocab.get_itos()), 128, len(vocab.get_itos()))

The length of the vocab is 102,577. I don’t know if that’s too much or if my computer is just bad but if it is too big what should I do to reduce it

What exceptions or errors do you get? What are your system specs?

What happens if you try just for testing:

rnn = RNN(100, 128, 100)

I dont know how to check on a jupyter notebook but I am on an intel 2017 MacBook air with 8GB ram

It runs instantly no errors. As I said i think it might be the large vocab. How can I decrease it

with a vocab size of 102577 you have (102577+128)*128 parameters per linear layer.

this gives a total amount of parameters of 3*(102,577+128)*128 = 39438720. Assuming you are training with 32 bit precision this should take a total of 150 mb. This is much, but acceptable, even for 8GB of RAM. There are networks out there, that are MUCH bigger.
What is your batch size?

you could train only on lower case letters. This should bring down your vocabulary by a bit. But I would recommend you to use a proper Tokenizer

There are many common preprocessing techniques, e.g.;

  • case-folding
  • stemming/lemmatization
  • normalization (e.g., removal of numbers, emoji, emoticons, punctuation marks)
  • removal of rare words (optional: replacing with a special “unknown” token)
  • subword tokenization

Which steps are appropriate depends on your tasks. Since you are trying to generate text, stemming/lemmatization is properly out. Maybe this notebook gives some ideas.

For me the kernel only crashes with the code above

I uploaded the code to kaggle

You can try to convert your notebook to a .py script and run it. Jupyter might have settings to restrict the memory a notebook/kernel is allowed to use:

jupyter nbconvert --to script mynotebook.ipynb

And then run


I as also wondering what you’re trying to do. From your notebook it looks like you want to predict the headline give a short description of a news article. Is this correct? Because if so, I wondering about this snippet:

    for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(input_line_tensor[i], hidden)
        l = criterion(output, target_line_tensor[i])
        loss += l

As this assume that input_line_tnesor and target_line_tensor have the same length.

Your network is suitable for a sequence-labeling task. However, if you want to generate headline from a news article as input, then this is a sequence-to-sequence architecture, and you need a encoder-decoder architecture.

I’m pretty new to machine learning and replacing this tutorial with my own code. The aim is to generate a headline given the description of the article. The code you showed was copied from the tutorial and I had not edited it yet.

I run the notebook as a script and after reaching 55GB of ram it gave me an error of

Killed: 9

I feel like it might be a memory leak because as @bloos said it shouldn’t be using that much ram

This is a very different task that is done here. Look how the input and target of a single training sample looks like:


The input is a name, and the target is the same name shifted by one letter to the left. In short, this network’s goal is to train a language model which will then generate a name given a start sequence of letters.

What you want is a Seq2Seq model similar to Machine Translation, where the input is a text (e.g., a short summary of a news article) and the output is a new text (e.g., a headline).

Oh, I thought I could change it from letters to words and generate words. Anyways I’ll try the Seq2Seq model