Optimize transfer speed from CPU to GPU for simple computations

I have parameterized a filtering problem, so that I essentially only have to make a product and sum between two arrays. I want to do this on the GPU, using pytorch to accelerate this.

Right now my code is:

kernels_cuda = torch.cuda.FloatTensor(kernels) #  (32,32,159,21,21,21) float32 array
sub_matrices_cuda = torch.cuda.FloatTensor(sub_matrices) # (32,32,159,21,21,21) float32 array
out = (sub_matrices_cuda*kernels_cuda).sum(dim=[3,4,5]).cpu().numpy()

This takes about 3.0s right now, and almost all the time is spent on sending the data to the GPU. Is there anyway I can speed up my code?

My hardware is: Titan RTX, Intel Xeon W2133 and 2666 MHz DDR4 RAM. I have the feeling, that I should be able to exceed this on this hardware?