Minimal Memory Usage in C++ interface

marijnfs · February 7, 2019, 10:55am

Hi, I’m currently using the C++ interface to load jit-ed models and it works like a charm.
However there is not much control over memory usage. I would like to do two things:

Firstly I need to clear the cache after running a model. The closest I found in the torch code were calls to c10::cuda::CUDACachingAllocator::emptyCache(); , would this indeed be the right function to call? It doesn’t seem exposed in the code but I could probably find a way to change to code to call it.
Secondly what would be extremely useful is a minimal memory usage mode where intermediate activations are released as soon as possible. When you are only interested in the output, you don’t need the activations and it would really be great for using Torch in production!
It would be like the NoGrad guard, but basically a NoGradNoState guard. Is there already a way to achieve this? Or can someone perhaps outline how someone would implement it in the code?

albanD · February 7, 2019, 1:19pm

Hi,

For the empty cache, this function should be available if you include the right headers. You can check how it’s used in pytorch here.
If you run with no grad, then no state will be saved. States are only saved if they are required to compute gradients. So if no gradients are needed, no state will ever be saved. You can use the GradMode object for such tasks as done here.

marijnfs · February 7, 2019, 3:08pm

Thanks for the reply! I see the CUDACachingAllocator.h has very recently been added, what was the place for caching logic before that?

albanD · February 7, 2019, 4:59pm

It has been added to c10 recently right? It was in aten before that and THC before that.

marijnfs · February 8, 2019, 10:39am

Yeah all these directories are kind of confusing me, but I see things are getting more consolidated?

albanD · February 8, 2019, 10:46am

The code of the allocator is not changing as it’s already quite solid.
It’s being moved around to fit better in the new libraries and use new features. The move from THC to Aten was to be able to link it better to all the Aten Tensor logic. The move from Aten to c10 is to be able to use it both in pytorch and caffe2 (as both use c10).