.to(device) slowing the code

s_n · May 13, 2020, 6:18am

Hi,
Whenever I do a .to(device) for any tensor in my code it takes too much time and makes my code slow. I am not able to understand why is that happening?

AlbanOdot · May 13, 2020, 10:46am

Hello,

Your tensors might be huge and the bandwidth to your GPU not big enough thus slowing down the process.

You can reduce your batchsize and accumulate the gradient (by not using optimizer.zero_grad() at every backward() call). This will most likely reduce the bottleneck around the bandwidth.

if you reduce your batchsize by let’s say 4 you can use

if counter % 4 == 0:
    optimizer.zero_grad()
.
have fun with your loss here
.
if counter % 4 == 0:
    optimizer.step()
counter += 1

albanD · May 13, 2020, 4:25pm

Note as well that the .to() functions are the only blocking operations on the CUDA api that is completely asynchronous otherwise.
So if you try to time your code without using torch.cuda.synchronize() properly, it is expected that the .to() functions are the only ones that will take a significant amount of time ! But this is only because these functions are waiting on the GPU to finish processing all other operations before returning.

s_n · May 13, 2020, 9:12pm

Actually I was using .to(device) in different code (not for calculating loss) where I need to create torch variable and copy my inputs to this variable and do some operations on this variable.

s_n · May 13, 2020, 9:13pm

with torch.cuda.synchronize() as well I get the same timing.
As in still .to(device) consume too much time in my profiling results.