Memory usage of jit module with CUDA model cannot be release in libtorch

I find that the memory usage is really much when I load a cuda model. I do some experiments and find that if I load a traced model by torch::jit::load, It cannot be really released if it belongs to cuda.

here is my testing code:

class model
    torch::jit::script::Module module;
    void load()
        module = torch::jit::load("/home/gino/");  // a cuda model, I also prepared a cpu one.

int main()
        //stage.1 initialization ( not yet load model )
        cout<<"initialization ..."<<endl;
        unique_ptr<model> myModel;

        //stage.2 load a model
        cout<<"load ..."<<endl;

        //stage.3 release the unique_ptr

        cout<<"try reset ... "<<endl;

    //stage.4 outside the lifecycle
    cout<<"try outside ... "<<endl;

    return 0;

the testing code contains 4 steps: 1. run the program and do nothing. 2. load a model by jit 3. reset the unique_ptr which contains the jit module 4. outside the lifecycle ( I assume that the ptr would automatically be gone so even I do something wrong to deal with the ptr, the ptr would still release here )

I use linux command to check the memory usage

free -m

and I would see the “available” value to check the memory is freed or not.

I use jit to trace the EasyOcr’s text detection model, then saved a cpu and cuda model. However, The model itself doesn’t matter to this testing.

Here is the testing result

  1. available memory of CPU model
    stage.0 (before running the program ): 6650
    stage.1 (run the program and do nothing) : 6593
    stage.2 (load the model) : 6515
    stage.3 (release the outer class): 6590
    stage.4 (before end the program): 6594

as you can see , the memory usage is released successfully when I use a cpu model. However, If I load a CUDA model it would be really huge and behavior wired.

  1. available memory of CUDA model
    stage.0 (before running the program ): 6545
    stage.1 (run the program and do nothing) : 6459
    stage.2 (load the model) : 5344
    stage.3 (release the outer class): 5342
    stage.4 (before end the program): 5340

now you see that even I reset the outer class , the memory usage is not released. The cuda model is too huge to ignore it so I’m finding some ways to release it when the model finished its job. I find some discussion about using “cudaDeviceReset()” can reset everything in cuda then free the memory usage. However, I’m curiously that the usage in the ram can also be released or not.

So, How do I release the jit::module correctly ?

  • Are we talking about CPU RAM or GPU RAM?
  • Upon first use of the GPU, PyTorch will initialize CUDA. This part uses (quite a bit) of memory (both GPU and CPU) that won’t be released until PyTorch exists. You could separate this in you analysis by creating a small cuda tensor before loading any model.
  • For the “user allocated CUDA ram”, using c10::emptyCache() probably is a good idea, also using the caching allocator’s statistics might give you more insight.

Best regards