I have recently upgraded pytorch from 0.2 to 0.3. Surprisingly my old programs are throwing an out of memory error during evaluation (in eval() mode) but training works just fine. I am using the same batch size for training and evaluation. I am totally clueless what is happening? Did anyone face similar issue? Is there any possible solution?
Did you use the
volatile=True param on your Variables?
Is the batch size larger during eval than train?
Do you use cuDNN in both cases?
Could you post a code snippet reproducing the issue?
I tried using volatile=True param on the variables and it didn’t help. I am using the same batch size. I am not doing anything special to use cuDNN. I am using the default setting.
def validate(self, dev_corpus): # Turn on evaluation mode which disables dropout. self.model.eval() dev_batches = helper.batchify(dev_corpus.data, self.config.batch_size) print('number of dev batches = ', len(dev_batches)) dev_loss = 0 num_batches = len(dev_batches) for batch_no in range(1, num_batches + 1): session_queries, session_query_length, rel_docs, rel_docs_length, doc_labels = helper.session_to_tensor( dev_batches[batch_no - 1], self.dictionary) if self.config.cuda: session_queries = session_queries.cuda() session_query_length = session_query_length.cuda() rel_docs = rel_docs.cuda() rel_docs_length = rel_docs_length.cuda() doc_labels = doc_labels.cuda() loss = self.model(session_queries, session_query_length, rel_docs, rel_docs_length, doc_labels) if loss.size(0) > 1: loss = loss.mean() dev_loss += loss.data return dev_loss / num_batches
I am using the above function for evaluation. Here, session_queries, session_query_length, … rest variables are created by enabling volatile=True.
I am not sure what is hapenning!!
Hi. Did you do well your problem? Now, I meet the same problem as well. How do you do in this situation?
volatile flag is deprecated. In the latest stable release (
0.4.0) you should use a context manager:
with torch.no_grad(): # Your eval code
hi. ptrblck. thank you for your help. it works now.
Hit the same problem, same solution worked, pytorch 4.0.1. Seems like there might be something weird going on with the eval mode memory management.
A relevant clear-cut answer on ‘model.eval()’ vs ‘with torch.no_grad()’ from @albanD:
Thanks Arul, That’s helpful…
Still doesn’t explain why eval mode appears to use more memory than training mode though. Theoretically it would just use the same.
I am a novice. Have the same problem but during the inference, I have never met ‘out of memory’ error without using the
volatile=True before. But at this time it seems not to work without using
torch.no_grad(). pytorch 3.0.0.
+1, why use more memory than training?
You might run out of memory if you still hold references to some tensors from your training iteration.
Since Python uses function scoping, these variables are still kept alive, which might result in your OOM issue. To avoid this, you could wrap your training and validation code in separate functions. Have a look at this post for more information.