Why GPU memory usage keeps ceaselessly growing when training the model?

ShawnGuo · March 12, 2017, 2:39am

Hello everyone. Recently, I implemented a simple recursive neural network. When training this model on sample/small data set, everything works fine. However, when training it on large data and on GPUs, “out of memory” is raised. Along with the training goes on, usage of GPU memory keeps growing up. So, I want to know, why does this happen? I would be grateful if you could help.

The model and training procedure are defined as follow:

    def train_step(self, data):
        train_loss = 0
        for _data in data:
            p_tree = _data['p_tree']
            h_tree = _data['h_tree']
            if args.cuda:
                target = Variable(torch.LongTensor([_data['label']]).cuda())
            else:
                target = Variable(torch.LongTensor([_data['label']]))
            self.optimizer.zero_grad()
            # self.model is an instance of class RootAlign
            output = self.model(p_tree, h_tree)
            loss = F.nll_loss(output, target)
            loss.backward()
            self.optimizer.step()
            train_loss += loss.data[0]
        return train_loss

class RootAlign(nn.Module):
    def __init__(self, word_embedding, config):
        super(RootAlign, self).__init__()
        self.rnn = VanillaRecursiveNN(word_embedding, config['hidden_dim'], config['cuda_flag'])
        self.linear = nn.Linear(config['hidden_dim'] * 2, config['relation_num'])

    def forward(self, p_tree, h_tree):
        p_tree.postorder_traverse(self.rnn)
        h_tree.postorder_traverse(self.rnn)

        out = F.log_softmax(self.linear(F.sigmoid(torch.cat((p_tree.calculate_result, h_tree.calculate_result), 1))))
        return out

class VanillaRecursiveNN(nn.Module):
    def __init__(self, word_embedding, hidden_dim, cuda_flag=False):
        super(VanillaRecursiveNN, self).__init__()
        self.word_dim = word_embedding.embeddings.size(1)
        self.hidden_dim = hidden_dim

        self.embedding = nn.Embedding(word_embedding.embeddings.size(0),
                                      self.word_dim)
        self.embedding.weight = nn.Parameter(word_embedding.embeddings)

        self.word2hidden = nn.Linear(self.word_dim, self.hidden_dim, False)
        self.hidden2hidden = nn.Linear(2 * self.hidden_dim, self.hidden_dim)

        self.cuda_flag = cuda_flag

    def forward(self, node):
        if not node.val is None:
            if self.cuda_flag:
                node.calculate_result = self.word2hidden(
                    self.embedding(Variable(torch.LongTensor([node.word_id]).cuda())))
            else:
                node.calculate_result = self.word2hidden(
                    self.embedding(Variable(torch.LongTensor([node.word_id]))))
            return node.calculate_result
        else:
            assert len(node.children) == 2
            node.calculate_result = self.hidden2hidden(torch.cat((node.children[0].calculate_result,
                                                          node.children[1].calculate_result), 1))
            return node.calculate_result

apaszke · March 12, 2017, 11:08am

Do you need to save node.calculate_result? Are you using these values later on? If not I’d discourage saving them.

If you need them, but don’t want to backprop through them (it seems to be the case), you should save only the tensor, not the Variable that wraps it. This will allow the graph that holds the buffers necessary for backward to be freed and release held memory. Just replace node.calculate_result = ... with result = ...; node.calculate_result = result.data.

OCY · March 12, 2017, 3:22pm

I observed similar GPU memory behavior.

When I test two different implementations of the same function as below,
where ‘A’ is cuda.Tensor.

def function1(A):
    B = A**2 - 2*A
    C = torch.sqrt(B)
    return C

def function2(A):
    return torch.sqrt(A**2 - 2*A)

Both functions are the same as a function.
However, function1 seems to assign GPU memory to local variables ‘B’, 'C’
on the other hand, function2 seems to only assign GPU memory to memory needed to calculate torch.sqrt(A**2 - 2*A)
which is presumably the same size as A

Thus, in terms of memory usage, it seems that function2 is twice efficient than function1.

This doesn’t apply to all the cases, but in many cases, removing intermediate variables reduces GPU memory usage a lot in my programs.

This seems to be that underlying CUDA does not free memory immediately after the moment that memory is not needed anymore.

I think some GPU memory Garbage Collection method in pytorch is needed for efficient GPU memory management.

apaszke · March 12, 2017, 4:05pm

Both functions will consume the same amount of memory. The execution will look like this (in parenthesis you have current/peak memory usage in multiplies of A size):

Assume A is allocated (1/1)
Compute A**2 (2/2)
Compute 2*A (3/3)
Compute A**2 - 2*A (4/4)
Free A**2 and 2*A (2/4)
Compute torch.sqrt(B) (3/4)
Return and free everything except the input and result (2/4)

Nevertheless, it is good advice to try to minimize the number of local variables - the sooner things go out of scope the sooner the memory will be available for reuse (btw you can use del to free locals you don’t need).

There’s no way for the framework to know when a tensor won’t be needed anymore, we don’t have that knowledge upfront, abnd this is why it’s impossible to implement any garbage collection. The memory management is already very efficient and all tensors are freed as soon as you let them go.

OCY · March 12, 2017, 4:14pm

Thanks!

Things became clear.

ShawnGuo · March 13, 2017, 1:08am

Thanks for the help. I did the replacement as you stated above in the following ways.

# way No.1
if not node.val is None:
            if self.cuda_flag:
                variable =Variable(torch.LongTensor([node.word_id]).cuda())
            else:
                variable = Variable(torch.LongTensor([node.word_id]))
            result = self.word2hidden(self.embedding(variable))
            node.calculate_result = result.data
            return node.calculate_result

# way No.2
if not node.val is None:
            if self.cuda_flag:
                node.calculate_result = self.word2hidden(self.embedding(
                        Variable(torch.LongTensor([node.word_id]).cuda()))).data
            else:
                node.calculate_result = self.word2hidden(self.embedding( 
                        Variable(torch.LongTensor([node.word_id])))).data
            return node.calculate_result
# way No.3
if not node.val is None:
            if self.cuda_flag:
                result = self.word2hidden(
                    self.embedding(Variable(torch.LongTensor([node.word_id]).cuda())))
            else:
                result = self.word2hidden(
                    self.embedding(Variable(torch.LongTensor([node.word_id]))))
            node.calculate_result = result.data
            return node.calculate_result

However, they all raised same TypeError, the trace back information is:

Traceback (most recent call last):
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/trainer.py", line 172, in <module>
    t.train()
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/trainer.py", line 111, in train
    train_loss = self.train_step(self.data.train)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/trainer.py", line 143, in train_step
    output = self.model(p_tree, h_tree)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 210, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/align_model.py", line 18, in forward
    p_tree.postorder_traverse(self.rnn)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 186, in postorder_traverse
    c.postorder_traverse(func)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 186, in postorder_traverse
    c.postorder_traverse(func)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 186, in postorder_traverse
    c.postorder_traverse(func)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 187, in postorder_traverse
    func(self)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 219, in __call__
    var = var[0]
TypeError: 'float' object has no attribute '__getitem__'

It seems that the torch.LongTensor([node.word_id]) should be replaced with torch.LongTensor([[node.word_id]]). However, after I fix it, a new RuntimeError is raised. The trace back is:

Traceback (most recent call last):
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/trainer.py", line 172, in <module>
    t.train()
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/trainer.py", line 111, in train
    train_loss = self.train_step(self.data.train)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/trainer.py", line 143, in train_step
    output = self.model(p_tree, h_tree)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 210, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/align_model.py", line 18, in forward
    p_tree.postorder_traverse(self.rnn)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 186, in postorder_traverse
    c.postorder_traverse(func)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 186, in postorder_traverse
    c.postorder_traverse(func)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 186, in postorder_traverse
    c.postorder_traverse(func)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree.py", line 187, in postorder_traverse
    func(self)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 210, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/shawnguo/PythonWS/KnowledgeEnhancedTE/tree_models.py", line 26, in forward
    self.embedding(Variable(torch.LongTensor([[node.word_id]]).cuda())))
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 210, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/linear.py", line 52, in forward
    return self._backend.Linear()(input, self.weight)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/_functions/linear.py", line 10, in forward
    output.addmm_(0, 1, input, weight.t())
RuntimeError: matrix and matrix expected at /home/soumith/local/builder/wheel/pytorch-src/torch/lib/THC/generic/THCTensorMathBlas.cu:235

I don’t know why the forward function raise up these errors and how to fix them up. And, as I need to optimize self.word2hidden = nn.Linear(self.word_dim, self.hidden_dim, False) and self.hidden2hidden = nn.Linear(2 * self.hidden_dim, self.hidden_dim) in class VanillaRecursiveNN, it seems that node.calculate_result needs to be saved for backprop. So, your solution may not address the problem that I mentioned above, usage of GPU memory growing up ceaselessly. Should I manually free the GPU memory? If yes, how?

Anyway, thanks again for your help. Looking forward for your reply.

jhjungCode · March 13, 2017, 6:14am

I think it’s just wrong input number of fc-layer and usage of cat() fucntion

import torch
hidden_dim = 10

x = torch.randn(hidden_dim, 1).cuda()
print(x.size())
y = torch.cat((x, x), 1)
print(y.size())
y = torch.cat((x, x), 0)
print(y.size())

result is below

torch.Size([10, 1]) <-- hidden_dim x 1
torch.Size([10, 2])
torch.Size([20, 1]) <-- 2 * hidden_dim x 1

so, fixed code is

self.hidden2hidden = nn.Linear(2 * self.hidden_dim, self.hidden_dim)
...
node.calculate_result = self.hidden2hidden(torch.cat((node.children[0].calculate_result, node.children[1].calculate_result), 0))

ShawnGuo · March 13, 2017, 6:42am

Thanks for your help. However, the problem occurs in the following code:

if not node.val is None:
            if self.cuda_flag:
                node.calculate_result = self.word2hidden(
                    self.embedding(Variable(torch.LongTensor([node.word_id]).cuda())))
            else:
                node.calculate_result = self.word2hidden(
                    self.embedding(Variable(torch.LongTensor([node.word_id]))))
            return node.calculate_result

And, I’ve found a puzzling phenomenon, if the above code have been changed to:

if not node.val is None:
            if self.cuda_flag:
                variable =Variable(torch.LongTensor([node.word_id]).cuda())
            else:
                variable = Variable(torch.LongTensor([node.word_id]))
            node.calculate_result = self.word2hidden(self.embedding(variable))
            return node.calculate_result

The TypeError would be raised up. Isn’t these two implementation same? If not, what’s the difference?

jhjungCode · March 13, 2017, 7:42am

You should check the connectivity between network layers

Try it

self.embedding(Variable(torch.LongTensor(node.word_id).cuda())))

or

add squeeze variable

variable = variable.squeeze()

ShawnGuo · March 13, 2017, 9:01am

Thanks for your reply. The first solution is obviously impracticable. torch.LongTensor(node.word_id) would give out a Tensor whose length is node.word_id(an integer). As for the second one, I don’t know the function of .squeeze(). But as the problem occurs in self.embedding(Variable(torch.LongTensor([node.word_id]).cuda()))), it may not help solving the problem either.

Anyway, thanks for your help.

apaszke · March 13, 2017, 9:05am

You don’t need to save anything for backprop, autograd will take care of that, and my solution is valid. The problems you’re having are only due to giving inputs of invalid sizes to different modules. You can print them inside your module and see if they are what you expect, and what matches the requirements specified in the docs.

ShawnGuo · March 13, 2017, 10:04am

Yes，you’re right. The computation in case “not node.val is None” is correct. Problem is in the computation of the other case. I’m trying to fix it.

Thank you very much!

ShawnGuo · March 13, 2017, 10:23am

Now, it seems that the return in following codes has something wrong.

if not node.val is None:
            if self.cuda_flag:
                variable =Variable(torch.LongTensor([node.word_id]).cuda())
            else:
                variable = Variable(torch.LongTensor([node.word_id]))
            result = self.word2hidden(self.embedding(variable))
            node.calculate_result = result.data
            return node.calculate_result

Now, I guess I know the problem.
In module.py, abstract class Module has a method _ call _, in line 210-211, there is a loop:

while not isinstance(var, Variable):
        var = var[0]

As I return a torch.FloatTensor, the loop will keeps going until the error was raised.

After I do the following change, everything works again except the weights of model hasn’t been updated.

No.1, change ni RootAlign:

class RootAlign(nn.Module):
    def __init__(self, word_embedding, config):
        super(RootAlign, self).__init__()
        self.rnn = VanillaRecursiveNN(word_embedding, config['hidden_dim'], config['cuda_flag'])
        self.linear = nn.Linear(config['hidden_dim'] * 2, config['relation_num'])

    def forward(self, p_tree, h_tree):
        p_tree.postorder_traverse(self.rnn)
        h_tree.postorder_traverse(self.rnn)

        p_result = Variable(p_tree.calculate_result)
        h_result = Variable(h_tree.calculate_result)
        out = F.log_softmax(self.linear(F.sigmoid(
            torch.cat((p_result, h_result), 1))))
        return out

No.2 Change in VanillarRecursiveNN:

def forward(self, node):
    if not node.val is None:
        if self.cuda_flag:
            result = self.word2hidden(self.embedding(
                Variable(torch.LongTensor([node.word_id]).cuda())))
        else:
            result = self.word2hidden(self.embedding(
                Variable(torch.LongTensor([node.word_id]))))
        node.calculate_result = result.data
        return result
    else:
        assert len(node.children) == 2
        l_result = Variable(node.children[0].calculate_result)
        r_result = Variable(node.children[1].calculate_result)
        result = self.hidden2hidden(torch.cat((l_result, r_result), 1))
        node.calculate_result = result.data
        return result

It seems that the usage of GPU memory still grows ceaselessly.

apaszke · March 13, 2017, 10:48am

Ah I now see that you’re storing the Variables in the tree, because you need to read them from the higher parts. In this case you can’t unpack the .data, because you need the backprop to include the lower parts of the tree too.

The problem I see is that you’re not clearing the Variables/tensors stored in the trees after you finish the iteration. You could try writing a simple function that traverses the tree and dels all outputs after computing output = self.model(p_tree, h_tree) (you need to clean both trees). This should reduce the memory usage.

ShawnGuo · March 13, 2017, 10:51am

Well, I guess that this is the key to address the problem. I’ll try immediately.

Cool!!!The problem is addressed. Thank you very much!!!

ShawnGuo · March 13, 2017, 11:41am

I have an additional question. How to batch tree data when training model? Every tree has their own structure. How can I batch them under the current implementation of Recursive model?

ps: The time of every epoch is about 4.5 hr on SNLI.(GPU: Titan X) It takes too long.

jekbradbury · March 13, 2017, 4:22pm

You can’t easily batch trees with this approach. You would need to use something like SPINN https://github.com/jekbradbury/examples/tree/spinn/snli/spinn.py (or in general batch before compute-heavy ops and unbatch after)

ShawnGuo · March 13, 2017, 11:54pm

Your implementation is cool~ I’ll learn from it and try to batch data on my model. Thanks very much.