Data-loader time is higher than network-time

I am puzzling over the timing info from the following simple loop with data loader. The observation is that tdata is more than half of titer. Is there any kind of synchronization happening in the data loader that waits for the backward pass to finish ?

print("iter: ", “batch_idx”, “tdata”, “tfwd”, “tloss”, “tbwd”, “titer”, “titercheck”, “args.batch_size/titer”, “args.batch_size/(titer-tdata)”, “time.time()” )
extralong = []
time00 = time.time()
for batch_idx, (data, target) in enumerate(train_loader):
data = data.to(args.device)
target = target.to(args.device)
if args.nhwc:
data = data.to(memory_format=torch.channels_last)
tdata = time.time() - time00
optimizer.zero_grad()
output = model(data)
tfwd = time.time() - tdata - time00
loss = criterion(output, target)
tloss = time.time() - tdata - tfwd - time00
loss.backward()
optimizer.step()
tend = time.time()
tbwd = tend - tdata - tfwd - tloss - time00
titer = tdata + tfwd + tloss + tbwd
titercheck = tend - time00
time00 = time.time()
if True:
print("iter: ", batch_idx, tdata, tfwd, tloss, tbwd, titer, args.batch_size/titer, args.batch_size/(titer-tdata), len(extralong), time.time() )
print("iter: ", batch_idx, tdata, tfwd, tloss, tbwd, titer, args.batch_size/titer, args.batch_size/(titer-tdata), len(extralong), time.time() )

I assume you are using the GPU for the actual model training. If so then note that synchronizations are missing which need to be added before starting and stopping timers since CUDA kernels are executes asynchronously. Without synchronizing the code the timers could accumulate any synchronizing operation.