Doubt about creating a tensor from a gpu pointer

deepfailure · March 9, 2022, 10:09am

Hello, I have a doubt in regards to the c++ implementation of an already trained model in python. I am new in the use of c++ for pytorch.

Given the trained model, I know how to load it in c++ implementation but I would like to speed up the inference. I receive the input as a pointer to GPU memory, is there any way to use this pointer as the input to the neural network without doing any extra copy?

I believe if I do:

Given shape1 and shape2 of the tensor, a trained model and a GPU pointer p:

auto options = torch::TensorOptions().device(torch::kCUDA);
auto tensor_from_cuda_pointer=torch::from_blob(p,{shape1,shape2}, options);
at::Tensor output = module.forward(tensor_from_cuda_pointer).toTensor();

A tensor from the CUDA memory will be created and the inference will be done. However, how can I be sure that there is no extra copy done here?

Thanks in advance.

deepfailure · March 10, 2022, 10:47am

To test my issue I’ve tried to create different big sized tensors and measure the gpu memory with the command nvidia-smi:

Create tensor1 on gpu and create tensor2 from pointer of tensor1.
Create only tensor1.
Create tensor1 and tensor2 from scratch on gpu
Create tensor1 from scratch on gpu, clone tensor1 and send it to gpu.

#include <torch/script.h> // One-stop header.
#include "windows.h"
#include <iostream>
#include <memory>

int main(int argc, const char* argv[]) {
	int mode = 1;
	int shape1 = 20;
	int shape2 = 1600;
	int shape3 = 1600;

	if (mode == 1)
	{	
		// one tensor created from scratch and a second tensor created from tensor1 pointer
		torch::Tensor ones_tensor = torch::ones({shape1, 1, shape2, shape3}, torch::device(torch::kCUDA));
		auto p = ones_tensor.data_ptr<float>();
		auto options = torch::TensorOptions().device(torch::kCUDA);
		torch::Tensor new_tensor_from_pointer = torch::from_blob(p, {shape1, 1, shape2, shape3}, options);
	} else if (mode == 2){
		// one tensor created from scratch to gpu
		torch::Tensor ones_tensor = torch::ones({shape1, 1, shape2, shape3}, torch::device(torch::kCUDA));
	} else if (mode == 3){
		// two tensors created from scratch to gpu
		torch::Tensor ones_tensor = torch::ones({shape1, 1, shape2, shape3}, torch::device(torch::kCUDA));
		torch::Tensor two_tensor = torch::ones({shape1, 1, shape2, shape3}, torch::device(torch::kCUDA));
	}
	else if (mode == 4) 
	{
		// one tensor created from scratch, second tensor created from clone
		torch::Tensor ones_tensor = torch::ones({shape1, 1, shape2, shape3}, torch::device(torch::kCUDA));
		auto new_tensor_cloned = ones_tensor.clone();
		torch::Device device(torch::kCUDA);
		new_tensor_cloned.to(device);
	}
	else
	{
		std::cout << "Doing nothing";
	}
	Sleep(30000);
}

The results are the following:

Mode 1: 667 MiB
Mode 2: 667 MiB
Mode 3: 863 MiB
Mode 4: 863 MiB

So it seems that in fact torch::from_blob does not copy the original tensor which is what I am searching for. However, I was expecting modes 3 and 4 to have approximatedly double the size as Modes 1 and 2, what could be a reason for that?

Thanks in advance.

ptrblck · March 11, 2022, 5:14am

The CUDA context would use memory to load kernels, the driver etc.