CUDA error: out of memory

Hi there,

I am encountering CUDA error: out of memory while implementing a simple CNN.
The error apparently is pointing to loss,backward(). I also used .detach() wherever I am saving metrics of accuracy and loss to a list. Am I missing something?

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-14-e531653ce2a2> in <module>
     20         outputs = net(items)
     21         loss = criterion(outputs,labels)
---> 22         loss.backward()
     23         optimizer.step()
     24 

~\Anaconda3\envs\pytorch\lib\site-packages\torch\tensor.py in backward(self, gradient, retain_graph, create_graph)
     91                 products. Defaults to ``False``.
     92         """
---> 93         torch.autograd.backward(self, gradient, retain_graph, create_graph)
     94 
     95     def register_hook(self, hook):

~\Anaconda3\envs\pytorch\lib\site-packages\torch\autograd\__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     88     Variable._execution_engine.run_backward(
     89         tensors, grad_tensors, retain_graph, create_graph,
---> 90         allow_unreachable=True)  # allow_unreachable flag
     91 
     92 

RuntimeError: CUDA error: out of memory

My code is as follows:

num_epochs = 50
train_loss = []
valid_loss = []
train_accuracy = []
valid_accuracy = []


for epoch in range(num_epochs):
    iter_loss = 0
    correct = 0
    # Train the network
    net.train()
    for i,(items,labels) in enumerate(train_loader):
        items = Variable(items)
        labels = Variable(labels)
        if cuda.is_available():
            items = items.cuda()
            labels = labels.cuda()
        optimizer.zero_grad()    
        outputs = net(items)
        loss = criterion(outputs,labels)
        loss.backward()
        optimizer.step()
        
        # Book keeping for iterations
        
        iter_loss += loss
        _,predictions = torch.max(outputs,1)
        correct_pred = (predictions==labels).sum()
        correct += correct_pred
        
    # Book keeping for Epochs
    
    train_loss.append(iter_loss.detach())
    train_accuracy.append((100*correct / len(MNIST_train)).detach())
        
    # Validate 
    net.eval()
    iter_loss = 0
    correct = 0
    for i,(items,labels) in enumerate(valid_loader): 
        items = Variable(items)
        labels = Variable(labels)
        if cuda.is_available():
            items = items.cuda()
            labels = labels.cuda()
        outputs = net(items)
        loss = criterion(outputs,labels)
        
        # Book keeping for iterations
        
        iter_loss += loss
        _,predictions = torch.max(outputs,1)
        correct_pred = (predictions==labels).sum()
        correct += correct_pred
        
    # Book keeping for Epochs
    
    valid_loss.append(iter_loss.detach())
    valid_accuracy.append((100*correct / len(MNIST_valid)).detach())
    
    # Print perfomance metrics
    
    print('epoch: {}/{} ,train_accuracy={},valid_accuracy={}'.format(epoch+1,num_epochs,train_accuracy[-1],valid_accuracy[-1]))
          

It seems you are storing the computation graph in this line: iter_loss += loss. Use iter_loss += loss.item().
Also add with torch.no_grad(): before the validation loop, as this will save some memory by avoiding storing variables necessary to calculate gradients.

1 Like

Does’t net.eval() do the same?

1 Like

Not really. model.eval() changes the behavior of some layers. For example nn.BatchNorm layers will use their running stats (in the default mode) and nn.Dropout will be deactivated.
If you don’t want to calculate gradients, which is the common case during evaluation, you should wrap the evaluation code into with torch.no_grad().

1 Like

Thanks. That was very helpful.

I find that I get cuda error inconsistently. Some times there is no error while the same code throws up error in another session. Any way to flush gpu cache or any other possible solution??

What kind of error do you get? Is your GPU running out of memory?

Did you check, if other applications or maybe dead kernels are using the GPU?
If so, you could try to kill them before running your script.
If you can’t detect any other applications, did you notice a certain pattern, when the OOM occurs?

I will check that. Thanks for the prompt response.

Hello Ptrblck

My CNN network is shallow and batch size is 64 , I run my code one time but for the next time it give me CUDA out of memory ?
Do you have any suggestion to check?

Do you see an increase in memory usage during training?
If so, you might accidentally store the computation graph, e.g. by storing loss in a Python list.
If you see the OOM error in the second epoch / iteration, you could try to wrap your training procedure into a function, since Python uses function scoping as described here.

If neither of these two suggestions helps, could you post your code so that we could have a look?

Hi Ptrblck

I really appreciate your answer. My problem is now that I am out of quote and I just install anaconda. Dose bracewell has anaconda? Using this command can give me more space?

torch.cuda.empty_cache()?

By “out of quote”, do you mean out of memory?
I’m not sure, what bracewell is.

Emptying the cache should give you just the cached memory back, which should not avoid OOM issues and might make your code run slower.

1 Like

Thanks a lot. I request more storage from our service help.

I appreciate your help

Cheers

Saba

Hi

Would you please introduce me some practical books regarding pytorch and Deep learning. I saw some tutorials but need to know more.

Cheers

Saba