How to free CPU memory after inference in libtorch?

Here’s my question: I is inferring image on GPU in libtorch. it occupies large amount of CPU memory(2G+), when I run the code as fallow:

output = net.forward({ imageTensor }).toTensor();

Until the end of the main function, the CPU memory remains unfreed. I alse try to run “c10::cuda::CUDACachingAllocator::emptyCache();”, but nothing happened.
Wath can I do to free CPU memory?
@tom homas V @albanD

Two things:

  • be sure to have autograd disabled,
  • put output in a local C++ scope so it goes out of scope.
    Also, the network parameters would still be around, right?

That should help.

Best regards


Thank you for your attention!
My simplified version of the code is as follow:

bool segMain(vtkImageData* vtkImageDataPixel,vtkImageData* outImageData, std::string modelPath)
	auto pixelDataShape = vtkImageDataPixel->GetDimensions();
	auto spacing = vtkImageDataPixel->GetSpacing();
	int imageDims[3];
	short* oriPixelPointer = static_cast<short*>(vtkImageDataPixel->GetScalarPointer());   //get data pointer

	auto device = torch::kCUDA;     // set cuda
	torch::jit::script::Module module; // load model
	module = torch::jit::load(modelPath);;


	at::Tensor imageTensor = torch::from_blob(oriPixelPointer, { imageDims[2], imageDims[0], imageDims[1] }, torch::kShort).to(torch::kFloat32);
	imageTensor = torch::flip(imageTensor, { 0,1,2 });
	auto imageTensorNorm = (imageTensor + 1024.0);
	at::Tensor modelOutputTensor = torch::ones({ imageDims[2], imageDims[0], imageDims[1] }, torch::kFloat32);
	at::Tensor imageTensorSlicer, modelOutputTensorSlicer;

	for (int i = 0; i < imageDims[2]; i++)
		imageTensorSlicer = imageTensorNorm[i].unsqueeze(0).unsqueeze(0);
		imageTensorSlicer =;     // predict
		modelOutputTensorSlicer = module.forward({ imageTensorSlicer }).toTensor();
		modelOutputTensorSlicer = torch::one_hot(torch::argmax(torch::softmax(modelOutputTensorSlicer, 1), 1), 2).permute({ 0, 3, 1, 2 }).detach();
		modelOutputTensor[i] = modelOutputTensorSlicer[0][1].data();
	modelOutputTensor = torch::flip(modelOutputTensor, { 0,1,2 });
	modelOutputTensor =;

	modelOutputTensor =;
	vtkSmartPointer<vtkImageData> vtkModelOutputImageData = vtkImageData::New();
	vtkModelOutputImageData->SetDimensions(pixelDataShape[0], pixelDataShape[1], pixelDataShape[2]);
	vtkModelOutputImageData->SetExtent(0, pixelDataShape[0] - 1, 0, pixelDataShape[1] - 1, 0, pixelDataShape[2] - 1);
	vtkModelOutputImageData->AllocateScalars(VTK_SHORT, 1);
	vtkModelOutputImageData->SetSpacing(spacing[0], spacing[1], spacing[2]);
	short* pixelPointer = static_cast<short*>(vtkModelOutputImageData->GetScalarPointer());   //get data pointer
	memcpy(pixelPointer, modelOutputTensor.data_ptr(), sizeof(short) * pixelDataShape[0] * pixelDataShape[1] * pixelDataShape[2]);

	return true;


int main()
	std::cout << "main func start run ..." << std::endl;
	std::string dataDirPath = "D:\\sourceData.mhd";
	std::string modelPath = "C:\\";

	auto imageReader = vtkSmartPointer<vtkMetaImageReader>::New();
	vtkImageData* vtkImageData = imageReader->GetOutput();

	auto outImage = vtkImageData::New();
	auto r = segMain(vtkImageData, outImage, modelPath);



	cout << " stop " << endl;

I think I have done the two things. but until it run to “cout << " stop " << endl;” , the CPU memory have not been released. And it will take up the same amount of memory, when I run segMain twice.
As far as know, the network parameters and the tensor is similar to a c++ smart pointer. But the CPU memory remains unfreed, even after a long sleep.
Thank you for the response.

Best regards