CPU inference causes OOM with repeated calls to forward

cjjgrj · September 22, 2022, 7:15pm

I have some libtorch code that is doing inference on the cpu using a model trained in pytorch that is then exported to torchscript. The code below is a simplified version of a method that is being repeatedly called.

m_model is the .ts file loaded via:

    m_model = torch::jit::load(path);
    m_model.eval();

Every call it seems that more of the torch graph is being allocated, and it isn’t being freed causing the program to eventually OOM and crash. Commenting out the forward call causes the the memory usage to stabilize.

My understanding is that InferenceMode guard should turn off autograd memory buildup which seems to be the normal cause of these issues.

void Backend::perform(std::vector<float *> in_buffer,
                      std::vector<float *> out_buffer) {
  c10::InferenceMode guard;

  at::Tensor tensor_out;
  at::Tensor tensor_in = torch::zeros({ 1, 16, 2 }); 
  std::vector<torch::jit::IValue> inputs = { tensor_in };

  // calling forward on the model "decode," this is where 
  // the memory leak happens
  tensor_out = m_model.get_method("decode")(inputs).toTensor(); 
  
  auto out_ptr = tensor_out.contiguous().data_ptr<float>();

  for (int i(0); i < out_buffer.size(); i++) {
    memcpy(out_buffer[i], out_ptr + i * n_vec, n_vec * sizeof(float));
  }
}

I tried mimicking this in pytorch (by repeatedly calling forward from a loop), and there’s no memory issues

My system:
OS: Windows 10/11
pytorch version: 1.11.0
libtorch version: 1.11.0