How to free GPU memory of at::Tensor in ATen, C++?

I would like to use network in C++ by building tensors and operations of ATen using GPU, but it seems to be impossible to free GPU memory of tensors automatically. Is there any way to use garbage collector or some thing like it supported by ATen? Used platform are Windows 10, CUDA 8.0, CUDNN 7, Pytorch 0.4.0.

I found that ATen library provides automatically releasing memory of a tensor when its reference count becomes 0 if the tensor resides in CPU memory. In below code, I could find that CPU memory of the tensor is freed at (3*).

/* example 1 */
// cpu memory check (1*)
  auto tensor = at::CPU(at::kFloat).ones({1000*1000*400});
  // cpu memory chec (2*)
// cpu memory check (3*)

However, I found that CUDA tensor is not released (3**) in example 2 and that it is released only after the end of the program. I checked GPU memory at every (**) point with MSI Afterburner monitoring program.

/* example 2 */
// cpu memory check (1**)
  auto tensor = at::CUDA(at::kFloat).ones({1000*1000*400});
  // cuda memory chec (2**)
// cuda memory check (3**)

It found that CUDA tensor of ATen can be freed as following although I do not know that it is intended way and recommanded method to release memory. However, it seems that we can hardly use cudaFree to every single tensor output of ATen function such as at::conv2d

auto tensor = at::CUDA(at::kFloat).ones({1000*1000*400});
cudaFree(>data()); // ok. gpu memory is freed
out = at::conv2d(out, ...); // how to delete this intermediate tensor without (auto out2 = at::conv2d(out);)?
out = at::conv2d(out, ...);
out = at::conv2d(out, ...);

This is because of the caching allocator. Not sure what is the equivalent in c++, but the python way is to torch.cuda.empty_cache().


Given that people link this thread and appear to be looking for it:

As Simon says, when a Tensor (or all Tensors referring to a memory block (a Storage)) goes out of scope, the memory goes back to the cache PyTorch keeps. You can free the memory from the cache using

#include <c10/cuda/CUDACachingAllocator.h>

and then calling


(of course, you could try using torch:: instead of c10:: and see if it is automatically imported somewhere).

Best regards