Out of memory error during evaluation but training works fine!

wasiahmad · January 14, 2018, 9:19am

I have recently upgraded pytorch from 0.2 to 0.3. Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine. I am using the same batch size for training and evaluation. I am totally clueless what is happening? Did anyone face similar issue? Is there any possible solution?

ptrblck · January 14, 2018, 3:50pm

Sounds strange.
Did you use the volatile=True param on your Variables?
Is the batch size larger during eval than train?
Do you use cuDNN in both cases?

Could you post a code snippet reproducing the issue?

wasiahmad · January 15, 2018, 6:46am

I tried using volatile=True param on the variables and it didn’t help. I am using the same batch size. I am not doing anything special to use cuDNN. I am using the default setting.

def validate(self, dev_corpus):
    # Turn on evaluation mode which disables dropout.
    self.model.eval()

    dev_batches = helper.batchify(dev_corpus.data, self.config.batch_size)
    print('number of dev batches = ', len(dev_batches))

    dev_loss = 0
    num_batches = len(dev_batches)
    for batch_no in range(1, num_batches + 1):
        session_queries, session_query_length, rel_docs, rel_docs_length, doc_labels = helper.session_to_tensor(
            dev_batches[batch_no - 1], self.dictionary)
        if self.config.cuda:
            session_queries = session_queries.cuda()
            session_query_length = session_query_length.cuda()
            rel_docs = rel_docs.cuda()
            rel_docs_length = rel_docs_length.cuda()
            doc_labels = doc_labels.cuda()

        loss = self.model(session_queries, session_query_length, rel_docs, rel_docs_length, doc_labels)
        if loss.size(0) > 1:
            loss = loss.mean()
        dev_loss += loss.data[0]

    return dev_loss / num_batches

I am using the above function for evaluation. Here, session_queries, session_query_length, … rest variables are created by enabling volatile=True.

I am not sure what is hapenning!!

qoqo · May 18, 2018, 9:58am

Hi. Did you do well your problem? Now, I meet the same problem as well. How do you do in this situation?

ptrblck · May 18, 2018, 11:05am

The volatile flag is deprecated. In the latest stable release (0.4.0) you should use a context manager:

with torch.no_grad():
    # Your eval code

Have a look at the website for install instructions.
You can find the migration guide here.

qoqo · May 18, 2018, 12:42pm

hi. ptrblck. thank you for your help. it works now.

DuaneNielsen · August 25, 2018, 12:56am

Hit the same problem, same solution worked, pytorch 4.0.1. Seems like there might be something weird going on with the eval mode memory management.

InnovArul · August 25, 2018, 1:11am

A relevant clear-cut answer on ‘model.eval()’ vs ‘with torch.no_grad()’ from @albanD:

DuaneNielsen · August 26, 2018, 6:36pm

Thanks Arul, That’s helpful…

Still doesn’t explain why eval mode appears to use more memory than training mode though. Theoretically it would just use the same.

two_four · January 17, 2019, 7:40am

I am a novice. Have the same problem but during the inference, I have never met ‘out of memory’ error without using the torch.no_grad() or volatile=True before. But at this time it seems not to work without using torch.no_grad(). pytorch 3.0.0.

jia_lee · March 9, 2019, 2:23pm

+1, why use more memory than training?

ptrblck · March 9, 2019, 2:39pm

You might run out of memory if you still hold references to some tensors from your training iteration.
Since Python uses function scoping, these variables are still kept alive, which might result in your OOM issue. To avoid this, you could wrap your training and validation code in separate functions. Have a look at this post for more information.

DuaneNielsen · March 14, 2019, 5:35pm

@ptrblck , did you mean this post?

ptrblck · March 14, 2019, 5:58pm

Yes, @colesbury explains, why the memory usage might grow if some tensors weren’t deleted using function scoping.

Flo · December 20, 2019, 5:20pm

What is the effect if you forget torch.no_grad() besides the increased memory. Will you accumulate gradients in the validation block?

ptrblck · December 20, 2019, 5:40pm

The computation graph will be created and intermediate tensors are stored.
If you don’t call backward (which wouldn’t even be possible in a torch.no_grad() block), nothing else will change.

Flo · December 20, 2019, 5:48pm

Well you would call backward in the training portion. So would you then update the net with the grads tracked in the validation portion as well as those in the training portion? Assuming that torch.no_grad() was forgotten in validation.

ptrblck · December 20, 2019, 5:53pm

During training a new computation graph would usually be created, as long as you don’t pass e.g. the output of your validation phase as the new input to the model during training.

model = models.resnet18(pretrained=True)

# Pseudo validation phase
x1 = torch.randn(1, 3, 224, 224)
out = model(x1)

# Pseudo training phase
x1 = torch.ones(1, 3, 224, 224)
out = model(x1)
out.mean().backward()

In this code snippet you have “forgotten” to use torch.no_grad() during the validation phase.
However, since out is not used, it won’t have any effect on the gradients, but will just use unnecessary memory.

Flo · December 20, 2019, 6:01pm

Ok cool, what about if it’s set up this way.

crit = nn.SomeLoss()
optim = optim.SGD()
net = models.resnet18()

for e in range(num_epochs):
    
    # training
    pred = net(some_data)
    optim.zero_grad()
    loss = crit(pred, target)
    loss.backward() 
    optim.step()

    # validation
    valid_pred = net(some_validation_data)
    loss = crit(valid_pred, valid_target)

Would zero_grad take care of that?

ptrblck · December 20, 2019, 6:03pm

As long as you don’t calculate gradients via a backward call, no gradients will be accumulated.