The fastest way to allocate temp memory on GPU in cpp extension

huangdi · February 26, 2021, 9:33am

Hi, all.

I’ve meet a speed problem when allocating temp memory in cpp/cuda extension.

I’m writing an cuda operator, which needs temp memory during gpu computing. However, after measuring the speed of the operator, I found that the slowest part in the operator is the allocation of the temp memory. Moreover, allocating it in python is much slower than allocating it in cuda. The examples are shown below, and the time is measured by time.perf_counter().

Code 1: Allocating inside the python warper with zero initialization.

tmp_buffer1 = torch.zeros(4, 1, 20, 20, 20, 32).cuda()
tmp_buffer2 = torch.zeros(4, 32, 32).cuda()
tmp_buffer3 = torch.zeros(4, 20, 20, 20, 20, 32).cuda()
#######  my operator... ########

result:

time: 0.0035348859382793307

Code 2: Allocating inside the python warper without initialization.

tmp_buffer1 = torch.Tensor(4, 1, 20, 20, 20, 32).cuda()
tmp_buffer2 = torch.Tensor(4, 32, 32).cuda()
tmp_buffer3 = torch.Tensor(4, 20, 20, 20, 20, 32).cuda()
#######  my operator... ########

result:

time: 0.0029048159390687943

Code 3: Allocating inside the cuda code without initialization.

float *tmp_buffer1 = nullptr;
float *tmp_buffer2 = nullptr;
float *tmp_buffer3 = nullptr;
cudaMalloc((void**)&tmp_buffer1, 4*1*20*20*20*32*sizeof(float));
cudaMalloc((void**)&tmp_buffer2, 4*32*32*sizeof(float));
cudaMalloc((void**)&tmp_buffer3, 4*20*20*20*20*32*sizeof(float));
/////// some kernels for computing... ///////
cudaFree(tmp_buffer1);
cudaFree(tmp_buffer2);
cudaFree(tmp_buffer3);

result:

time: 0.000754036940634244

Is there anything like context->allocate_temp() in Tensorflow cpp extension?

What’s the fastest way of doing this?

albanD · February 26, 2021, 3:25pm

Hi,

Which time library are you using? Is it cuda aware and doing the proper synchronization (via torch.cuda.synchronize() for example)?
Also your code creates the Tensors on cpu then move them to the gpu, you can do the following to avoid this: torch.zeros(4, 1, 20, 20, 20, 32, device="cuda")

The problem with cudaMalloc/Free is that they need to access the GPU and so can be fairly slow. Allocating a Tensor should be faster because we use a custom allocator.

huangdi · March 1, 2021, 2:45am

Thanks. The problem has been found to be that I did not use device=“cuda” but use .cuda(), which costs a lot of time.