Hello all!
I have a torch.cuda.FloatTensor
of torch.Size([32000])
. My end goal is to have the variance for the contents of that tensor, but after profiling it, var()
was taking a long time so I thought I should write my own variance()
function to see where it was inefficient. Running torch.utils.bottleneck
reveals (bottleneck results) that sum()
is the culprit — my guess is that it is transferring the full tensor to the CPU to run. (I was able to replicate the speed hit precisely by transferring the tensor to the CPU first and then running my function.)
Is that intended behavior? I am far from a CUDA expert, but it seems like cuBLAS has summation functions and I would think it would be possible to do this all on the GPU. Or have I misdiagnosed the problem and am doing something else wrong?
Thanks in advance for any pointers y’all can provide!
Environment details:
Ubuntu 16.04
conda python 3.6.5
pytorch 0.4.0
CUDA 9.0