Overlapping CPU and GPU workload

I am trying to speedup data loading. Here is a common flow I have found:

from torch.utils.data import DataLoader

# some code

loader = DataLoader(your_dataset, ..., pin_memory=True)
data_iter = iter(loader)

next_batch = data_iter.next() # start loading the first batch
next_batch = [ _.cuda(non_blocking=True) for _ in nex_batch ]  # with pin_memory=True and async=True, this will copy data to GPU non blockingly

for i in range(len(loader)):
    batch = next_batch 
    if i + 2 != len(loader): 
        # start copying data of next batch
        next_batch = data_iter.next()
        next_batch = [ _.cuda(non_blocking=True) for _ in next_batch]

Is there a better way of accomplishing this? It just seems like it could be done more elegantly.