Tensor.sum() requiring a ton of CPU time

Hello all!

I have a torch.cuda.FloatTensor of torch.Size([32000]). My end goal is to have the variance for the contents of that tensor, but after profiling it, var() was taking a long time so I thought I should write my own variance() function to see where it was inefficient. Running torch.utils.bottleneck reveals (bottleneck results) that sum() is the culprit — my guess is that it is transferring the full tensor to the CPU to run. (I was able to replicate the speed hit precisely by transferring the tensor to the CPU first and then running my function.)

Is that intended behavior? I am far from a CUDA expert, but it seems like cuBLAS has summation functions and I would think it would be possible to do this all on the GPU. Or have I misdiagnosed the problem and am doing something else wrong?

Thanks in advance for any pointers y’all can provide!

Environment details:
Ubuntu 16.04
conda python 3.6.5
pytorch 0.4.0
CUDA 9.0

How did you time your function? Could you provide the script?
Maybe the profiling results show some unwanted copying and not the real execution time of sum.