Should we set non_blocking to True?

Well, if you don’t have any synchronization points in your training loop (e.g. pushing the model output to the CPU), use pin_memory=True in your DataLoader, the data transfer should be overlapped by the kernel execution:

for data, target in loader:
    # Overlapping transfer if pinned memory
    data = data.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)

    # The following code will be called asynchronously,
    # such that the kernel will be launched and returns control 
    # to the CPU thread before the kernel has actually begun executing
    output = model(data)  # has to wait for data to be pushed onto device (synch point)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Here are some more in-depth information from the NVIDIA devblog.

25 Likes