Hi, all.
I’ve meet a speed problem when allocating temp memory in cpp/cuda extension.
I’m writing an cuda operator, which needs temp memory during gpu computing. However, after measuring the speed of the operator, I found that the slowest part in the operator is the allocation of the temp memory. Moreover, allocating it in python is much slower than allocating it in cuda. The examples are shown below, and the time is measured by time.perf_counter()
.
Code 1: Allocating inside the python warper with zero initialization.
tmp_buffer1 = torch.zeros(4, 1, 20, 20, 20, 32).cuda()
tmp_buffer2 = torch.zeros(4, 32, 32).cuda()
tmp_buffer3 = torch.zeros(4, 20, 20, 20, 20, 32).cuda()
####### my operator... ########
result:
time: 0.0035348859382793307
Code 2: Allocating inside the python warper without initialization.
tmp_buffer1 = torch.Tensor(4, 1, 20, 20, 20, 32).cuda()
tmp_buffer2 = torch.Tensor(4, 32, 32).cuda()
tmp_buffer3 = torch.Tensor(4, 20, 20, 20, 20, 32).cuda()
####### my operator... ########
result:
time: 0.0029048159390687943
Code 3: Allocating inside the cuda code without initialization.
float *tmp_buffer1 = nullptr;
float *tmp_buffer2 = nullptr;
float *tmp_buffer3 = nullptr;
cudaMalloc((void**)&tmp_buffer1, 4*1*20*20*20*32*sizeof(float));
cudaMalloc((void**)&tmp_buffer2, 4*32*32*sizeof(float));
cudaMalloc((void**)&tmp_buffer3, 4*20*20*20*20*32*sizeof(float));
/////// some kernels for computing... ///////
cudaFree(tmp_buffer1);
cudaFree(tmp_buffer2);
cudaFree(tmp_buffer3);
result:
time: 0.000754036940634244
Is there anything like context->allocate_temp()
in Tensorflow cpp extension?
What’s the fastest way of doing this?