Well, if you don’t have any synchronization points in your training loop (e.g. pushing the model output to the CPU), use pin_memory=True
in your DataLoader
, the data transfer should be overlapped by the kernel execution:
for data, target in loader:
# Overlapping transfer if pinned memory
data = data.to('cuda:0', non_blocking=True)
target = target.to('cuda:0', non_blocking=True)
# The following code will be called asynchronously,
# such that the kernel will be launched and returns control
# to the CPU thread before the kernel has actually begun executing
output = model(data) # has to wait for data to be pushed onto device (synch point)
loss = criterion(output, target)
loss.backward()
optimizer.step()
Here are some more in-depth information from the NVIDIA devblog.