Libtorch using up memory regardless of NoGradGuard

Mikele · January 26, 2020, 6:24pm

Hello pytorch community,

(Libtorch c++ 1.4 version on mac and linux):
I am currently facing the issue in my code, that iterative calls from my monte carlo tree search (building alphazero structure) to a neural network’s forward function accumulates massive amounts of memory although I have put NoGradGuard there.
The forward method in question looks like this:

std::tuple< torch::Tensor, double > NetworkWrapper::evaluate(
   const torch::Tensor &board_tensor)
{
   // We dont want gradient updates here, so we need the NoGradGuard
   torch::NoGradGuard no_grad;
   m_network->eval();

   auto [pi_tensor, v_tensor] = m_network->forward(board_tensor.to(GLOBAL_DEVICE::get_device()));

   return std::make_tuple(pi_tensor.detach(), v_tensor.template item< double >());
}

This forward method will evaluate the tensor using a convolutional net and a fully connected one.
The board_tensor also has requires_grad set to false. There is also no gradient on the tensor after the forward method, so I am unsure what all this extra memory is used for.

The tree search which calls this evaluate method all the time will eat through 32 gb memory in no time. However if I put into my main:

torch::autograd::AutoGradMode guard(false);

then there is absolutely no excessive memory usage anymore.

What exactly does AutoGradMode control, that could lead to the memory excess?

ezyang · January 27, 2020, 6:52pm

It’s hard to see without seeing your whole code, but I suspect the memory usage is not in evaluate and actually occurring somewhere else (and doesn’t have anything to do with which guard you are using.) You said, “if I put into my main”, did you mean actually in your main() function, or in evaluate()? Do you seem the same result when you replace AutoGradMode guard(false) with NoGradGuard?

Mikele · January 28, 2020, 9:46am

Yes the codebase is relatively big and impossible to post to a forum unfortuantely.
I suspect you may be right though. Replacing the global (yes in main() function) autogradmode(false) with a grad guard removes the memory accumulation as well. Do you have an idea how to best track down a seemingly unknown call to torch?

Edit:
I figured out where the hidden call to torch actually comes from. A clear fault on my end. Thank you for your help!