I have parameterized a filtering problem, so that I essentially only have to make a product and sum between two arrays. I want to do this on the GPU, using pytorch to accelerate this.
Right now my code is:
kernels_cuda = torch.cuda.FloatTensor(kernels) # (32,32,159,21,21,21) float32 array sub_matrices_cuda = torch.cuda.FloatTensor(sub_matrices) # (32,32,159,21,21,21) float32 array out = (sub_matrices_cuda*kernels_cuda).sum(dim=[3,4,5]).cpu().numpy() torch.cuda.synchronize()
This takes about 3.0s right now, and almost all the time is spent on sending the data to the GPU. Is there anyway I can speed up my code?
My hardware is: Titan RTX, Intel Xeon W2133 and 2666 MHz DDR4 RAM. I have the feeling, that I should be able to exceed this on this hardware?