Release ALL CUDA GPU MEMORY using Libtorch C++

Hi,
I want to know how to release ALL CUDA GPU memory used for a Libtorch Module ( torch::nn::Module ). I created a new class A that inherits from Module. This class have other registered modules inside. I cannot release a module basic-class instance as nn::Conv2d.

To start I will ask for a simple case of how to release a simple instance of nn::Conv2d that has its memory in a CUDA GPU. Here an example :

int gpu_id = 0;
auto deviceCUDA = torch::Device(torch::kCUDA, gpu_id);
int c_in = 1000;
int c_out = 1000;
auto conv1 = nn::Conv2d(nn::Conv2dOptions(c_in, c_out, torch::ExpandingArray<2>(3)).stride(torch::ExpandingArray<2>(2)).padding(torch::ExpandingArray<2>(1)).bias(false));
	
conv1->to(deviceCUDA);

// HERE : RELEASE CUDA-GPU-CODE OF  conv1

I tried using : c10::cuda::CUDACachingAllocator::emptyCache();
But it is not enough . It releases a small cache memory .

Also , I would like to know, if there is a methodology to release memory using Libtorch C++ Modules , specifically CUDA GPU Memory .

Thanks a lot .

You can always use cudaFree on a tensor’s raw storage address as following, as long as you know the tensor is allocated on GPU.
cudaFree(tensor.storage()->data())

The nn::Module provides apis to get every tensor under that module (and its submodules if any). For conv2d, I think you just need to free its weight tensor and bias tensor within it.

But, the whole workflow seems weird to me. It seems to me that you inherit from an NN module and actually don’t need the content within it.

Hi glaringlee,

Thanks a lot for your answer !

I was working on your suggestions, but I couldn’t figure out how to free up all the GPU memory that was allocated after my program ran.

I was trying with several ways . I began trying to release the nn::Conv2d memory . Then I tried with a Tensor , a partial memory was released , I use the c10::cuda::CUDACachingAllocator::emptyCache() and nothing changed .

Here my testing code :
I commented and uncommented several parts of the code trying to find the best way .
I was trying to find the way to obtain all the Tensors of the Module and it didn’t work using the cudaFree with the Tensor’ data . Perhaps my approach was not the best.

If you or someone could give me some simple code that works for a module like nn :: Conv2d and / or a Tensor it would be great!

Here my testing code :

	int gpu_id = 0;
	auto device = torch::Device(torch::kCUDA, gpu_id);

	///// TRYING TO RELEASE A SIMPLE TENSOR ////
	
	///// GPU MEMORY			: 0.7 GB
	///// DEDICATED GPU MEMORY  : 0.6 GB
	int rows = 10000;
	int colums = 10000;
	int channels = 3;

	float * tensorDataPtr = new float[rows*colums*channels];
	auto tensorCreated = torch::from_blob(tensorDataPtr, { rows,colums,channels }, c10::TensorOptions().dtype(torch::kFloat32))/*.to(torch::kCUDA)*/;
	tensorCreated = tensorCreated.to(device);

	///// GPU MEMORY			: 2.3 GB
	///// DEDICATED GPU MEMORY  : 2.2 GB

	cudaFree(tensorCreated.data_ptr());

	///// GPU MEMORY			: 1.2 GB
	///// DEDICATED GPU MEMORY  : 1.1 GB

	c10::cuda::CUDACachingAllocator::resetAccumulatedStats(gpu_id);
	c10::cuda::CUDACachingAllocator::resetPeakStats(gpu_id);
	c10::cuda::CUDACachingAllocator::emptyCache();

	///////////////////////////////////////////

	///// GPU MEMORY			: 1.2 GB
	///// DEDICATED GPU MEMORY  : 1.1 GB

	int c_in = 1000;
	int c_out = 1000;
	auto conv2d = nn::Conv2d(nn::Conv2dOptions(c_in, c_out, torch::ExpandingArray<2>(3)).stride(torch::ExpandingArray<2>(2)).padding(torch::ExpandingArray<2>(1)).bias(false));
	conv2d->to(device);

	//cudaFree(conv2d->weight.storage().data());
	//cudaFree(conv2d->bias.storage().data());
	
	std::vector<at::Tensor> tensors = conv2d->parameters(true);
	for (size_t i = 0; i < tensors.size() ; i++)
	{
		auto tensor = tensors[i];
		auto storage = tensor.storage();
		auto data = storage.data();
		cudaFree(data);
	}
	
	tensors = conv2d->buffers(true);
	for (size_t i = 0; i < tensors.size(); i++)
	{
		auto tensor = tensors[i];
		auto storage = tensor.storage();
		auto data = storage.data();
		cudaFree(data);
	}
	
	//auto namedBuffers = conv2d->named_buffers(true);
	//
	//for (size_t i = 0; i < namedBuffers.size(); i++)
	//{
	//	auto tensor = namedBuffers[i]->values();
	//	auto storage = tensor.storage();
	//	auto data = storage.data();
	//	cudaFree(data);
	//}
	
	//auto named_parameters = conv2d->named_parameters(true);
	//
	//for (size_t i = 0; i < named_parameters.size(); i++)
	//{
	//	auto tensor = named_parameters[i]->values();
	//	auto storage = tensor.storage();
	//	auto data = storage.data();
	//	cudaFree(data);
	//}

@lbdalmendrayCaseguar
Hi, sry, for reply late. What is your libtorch version?
I tested your code with latest libtorch.
What I got is that, the cuda initialization takes 0.6-0.7 GB memory, and after created your tensorCreated, total memory is around 1.8 GB,
and after calling cudaFree(tensorCreated.data_ptr()); memory usage back to 0.6-0.7GB.
If you are using an old version libtorch, it probably a previous bug.
Let me know.

Here is some GPU memory info:
Cuda initialized:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02 Driver Version: 418.126.02 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 00000000:0B:00.0 Off | 0 |
| N/A 28C P0 69W / 250W | 639MiB / 11448MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Tensor created:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02 Driver Version: 418.126.02 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 00000000:0B:00.0 Off | 0 |
| N/A 26C P0 66W / 250W | 1785MiB / 11448MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 3810476 C ./tests 1774MiB |
±----------------------------------------------------------------------------+

Tensor freed:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.126.02 Driver Version: 418.126.02 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla M40 On | 00000000:0B:00.0 Off | 0 |
| N/A 28C P0 69W / 250W | 639MiB / 11448MiB | 0% Default |
±------------------------------±---------------------±---------------------+

Hello @glaringlee ,

Thank you very much for your answer !

My libtorch version is: libtorch-win-shared-with-deps-1.6.0-Cuda102.
Do you think I have to change the version to stable (1.7.1)?

I have to take some time to update the version. Hope all my code is compatible with the new version. I’ve been doing a lot of testing with version 1.6.0 and the update will take time.

Thank you !

@lbdalmendrayCaseguar
I am not 100% sure about this. But from 1.6 to 1.7.
We fixed several major cuda memory leaking issues for sure.

Thanks @glaringlee .