Why set cuda(non_blocking=False) for target variables?

laoreja · April 25, 2018, 9:57am

In this example: https://github.com/pytorch/examples/blob/master/imagenet/main.py and some other places, they set the non_blocking in cuda(device=None, non_blocking=False) to be True for the target variable only.

    target = target.cuda(True)

Why do they do this? And why this is only set to be True for the target Variable, but not the input Variable, in the above example, both the train and test data loaders’ pin_memory is set to be True. (I cannot understand the pin_memory setting in torch.utils.data.DataLoader as well)

train_loader = torch.utils.data.DataLoader(
    train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
    num_workers=args.workers, pin_memory=True, sampler=train_sampler)

val_loader = torch.utils.data.DataLoader(
    datasets.ImageFolder(valdir, transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        normalize,
    ])),
    batch_size=args.batch_size, shuffle=False,
    num_workers=args.workers, pin_memory=True)

samster25 · February 28, 2019, 2:06am

Non-Blocking allows you to overlap compute and memory transfer to the GPU. The reason you can set the target as non-blocking is so you can overlap the compute of the model and the transfer of the ground-truth. If you set the input to also non-blocking, it would yield no benefit due to the fact that the model has a dependency on the input data.

Pinned Memory allows the non-blocking calls to actually be non-blocking.

Omar_AlSuwaidi · August 22, 2021, 7:19am

Hey thanks for the thorough explanation.

Could you please clarify what do you mean by “overlap compute and memory transfer to the GPU.”? I understand what’s referred to here by “memory transfer” (asynchronously) , but what does the “compute” part refer to here?

Does that mean that the model (kernel) will start its execution process on the input data while the target is being transferred to GPU memory?