Pytorch tensor.to(device) too slow?

josipovic · February 20, 2020, 1:01pm

I’m having an issue of slow .to(device) transfer of a single batch. If I understood correctly, dataloader should be sampled from in the main training loop and only then (when the whole batch is gathered) should be transferred to gpu with .to(device) method of the batch tensor?

My batch size is 32 samples x 64 features x 1000 length x 4 bytes (float32) / (1024*1024) = ~7MB and I’m using 1080Ti graphics card. 7MB tensor.to(device) should be pretty much instantaneous? Or am I missing something in my calculation?

bottleneck profiler data is available here https://pastebin.com/dvQe6Y2Q (copy raw data to gedit/notepad for easy view)

here is the code that was profiled:

 it = iter(dataloader)
    for i in range(5):
        time_s = time.time()
        sample = next(it)
        optimizer.zero_grad()
        data = sample[0].to(device)
        iz = model(data)
        ctc_loss = nn.CTCLoss(zero_infinity=True)
        loss = ctc_loss(iz.transpose(0, 1).transpose(0, 2), sample[1], sample[2], sample[3])
        loss.backward()
        optimizer.step()
        time_e = time.time()
        print("Time ", time_e - time_s)

albanD · February 20, 2020, 4:05pm

Hi,

The cuda api is asynchronous for most operations. So a profiler that does not do proper synchronization won’t give you accurate informations.
In particular, a non-async copy to or from the GPU will force synchronization and so wait for all outstanding tasks.
So this is expected. You can try to add a torch.cuda.synchronize() just before this line and all the time will be spent in that function instead of the copy.

josipovic · February 21, 2020, 10:16am

@albanD Thanks for response ! could you please elaborate more? I’ve seen elsewhere that pytorch just asynchronously (queues?) adds operations to gpu. If that’s the case what does torch.cuda.synchronize() do exactly (and how do I speed it up with async copy)? When I measure time with time.time() and with torch.cuda.synchronize() before t1 and t2, it says its very fast.

EDIT: Async copy should make it faster? I have tried pin_memory=True on the dataloader and .to(non_blocking=True), if async copy is the way, is that all ?

albanD · February 21, 2020, 4:51pm

The cuda api is asynchronous. so yes it queues tasks for the gpu to execute. torch.cuda.synchronize() simply waits for all outstanding tasks to be finished before returning.

Async copy will be async. It depends on your workload/code whether or not it’s going to be faster. But it won’t make a huge difference for sure.