Should we set non_blocking to True?

I’ve read the document saying that if we have pinned memory, we could set non_blocking to true. Will this result in anything bad in our code? Like in my code, after doing data transferring ( data = data.to(device, non_blocking=True), I will call the forward method of the model. In this case, is there any difference between non_blocking is true or not, as for forward it has to wait for finishing data transferring?

10 Likes

If the next operation depends on your data, you won’t notice any speed advantage.
However, if the asynchronous data transfer is possible, you might hide the transfer time in another operation.
Did you encounter any strange issues using non_blocking=True?

8 Likes

Nah. Everything seems fine. Could you give a quick example on which (common) cases we should use non_blocking?

Well, if you don’t have any synchronization points in your training loop (e.g. pushing the model output to the CPU), use pin_memory=True in your DataLoader, the data transfer should be overlapped by the kernel execution:

for data, target in loader:
    # Overlapping transfer if pinned memory
    data = data.to('cuda:0', non_blocking=True)
    target = target.to('cuda:0', non_blocking=True)

    # The following code will be called asynchronously,
    # such that the kernel will be launched and returns control 
    # to the CPU thread before the kernel has actually begun executing
    output = model(data)  # has to wait for data to be pushed onto device (synch point)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Here are some more in-depth information from the NVIDIA devblog.

12 Likes

Thanks for the example! Will that be a good idea if I am using data parallel? In my mind, at least it will not hurt?

I would just try it and compare the wall time.
If there are any synchronization points, you should still end up with the same time as with non_blocking=False in the worst case.

3 Likes

It looks there is no real disadvantage in using “non_blocking=True”.
Why not to make it a default parameter?

7 Likes

Do you know what the expected behaviour is if we set non_blocking=True and pin_memory=False?

Is this dangerous or just a harmless no-op?

Thanks :slight_smile:

It should be harmless and I’m not aware of any side effects, but please let us know, if you see something weird. :slight_smile:

1 Like

Thanks. Are you able to point me to the source for this method? I couldn’t find it and I’d like to check what it does if pin_memory==False. I’ve been having some issues with dataloaders hanging when num_workers > 0 and I’m wondering if it’s this.

1 Like

In this code, you mention that output = model(data) is a synch point, which means that this code will not be executed asynchronously?

Hi, ptrblck

I have the same concerns of this: https://stackoverflow.com/questions/63460538/proper-usage-of-pytorchs-non-blocking-true-for-data-prefetching

output = model(data) is not synchronizing in itself, but would have to wait for the data to be transferred to the device. Sorry, if the explanation was confusing.

1 Like

@brynhayder Pinned memory is a finite resource and allocating excessive amounts of pinned memory will slow down your system. This is especially true for 3D data or very large batch sizes.

if we set non_blocking=True and pin_memory=False , I think it should be dangerous because there is a CachingHostAllocator in Pytorch to make sure that the pinned memory will not be freed unless kernel launched asynchronously in the CUDA stream.

Could you point me to the line of code to check this behavior, please?

I have found non_blocking=True to be very dangerous when going from GPU->CPU. For example:

import torch
action_gpu = torch.tensor([1.0], device=torch.device('cuda'), pin_memory=True)
print(action_gpu)
action_cpu = action_gpu.to(torch.device('cpu'), non_blocking=True)
print(action_cpu)

output

tensor([1.], device='cuda:0')
tensor([0.])

Process finished with exit code 0

Any idea why the tensors are not equal? I would expect the thread to block until the transfer from the GPU is finished.