Time cost on data load and traning

I have done some experiments about the time cost on data load and model training.

  1. Load data with torch.utils.data.DataLoader and train model.
    Time cost on dataload: 14.0 ms
    Time cost on training(dataload+model traning): 1.845 ms

  2. Almost all time cost on data.to(device) during loading data, so I tried another way to load data. I read data to a queue in other process, then I collected those data into a batch and executed data.to(device) in other thread and push them to a queue. In main thread, I access to those data by q.get().

Time cost on dataload: < 0.1 ms
Time cost on training(dataload+model traning): 1.885 ms

My question is, why dose method 2 cost more time on training?
I guess it is because the new thread slow down the main thread, but torch.utils.data.DataLoader load data with multithreading as well.

Could you post the code you’ve used to benchmark the data loading?
How large is each data tensor you push to the device?
Based on your description it seems the to() call is quite expensive and I would like to make sure you are indeed timing the right operations.
Note that CUDA operations are asynchronous, so you would need to synchronize via torch.cuda.synchronize() before starting and stopping the timer.
Otherwise some operations might create a synchronization point and thus collect the execution time of the currently executed async calls.

data size: [16, 3, 473, 473]

In method 1:

t0 = time.time()
batch = next(trainloader)
images, labels, _, _ = batch
images = images.to(device)
labels = labels.long().to(device)

t.append(time.time() - t0)
if i_iter % 20 == 0:
    print('batch read time: ', sum(t[-20:]) / 20)

###
    ....
    model forward and backward
    ....
###
t1.append(time.time() - t0)
if i_iter % 20 == 0:
    print('model train time: ', sum(t1[-20:]) / 20)

In method 2:

t0 = time.time()
batch = q.get() # .to() has been executed in other thread before
images, labels, _, _ = batch

t.append(time.time() - t0)
if i_iter % 20 == 0:
    print('batch read time: ', sum(t[-20:]) / 20)

###
    ....
    model forward and backward
    ....
###
t1.append(time.time() - t0)
if i_iter % 20 == 0:
    print('model train time: ', sum(t1[-20:]) / 20)

I have not ues torch.cuda.synchronize() before starting and stopping the timer.