Should we set non_blocking to True?

I would just try it and compare the wall time.
If there are any synchronization points, you should still end up with the same time as with non_blocking=False in the worst case.


It looks there is no real disadvantage in using “non_blocking=True”.
Why not to make it a default parameter?


Do you know what the expected behaviour is if we set non_blocking=True and pin_memory=False?

Is this dangerous or just a harmless no-op?

Thanks :slight_smile:

It should be harmless and I’m not aware of any side effects, but please let us know, if you see something weird. :slight_smile:


Thanks. Are you able to point me to the source for this method? I couldn’t find it and I’d like to check what it does if pin_memory==False. I’ve been having some issues with dataloaders hanging when num_workers > 0 and I’m wondering if it’s this.

1 Like

In this code, you mention that output = model(data) is a synch point, which means that this code will not be executed asynchronously?

Hi, ptrblck

I have the same concerns of this:

output = model(data) is not synchronizing in itself, but would have to wait for the data to be transferred to the device. Sorry, if the explanation was confusing.


@brynhayder Pinned memory is a finite resource and allocating excessive amounts of pinned memory will slow down your system. This is especially true for 3D data or very large batch sizes.

1 Like

if we set non_blocking=True and pin_memory=False , I think it should be dangerous because there is a CachingHostAllocator in Pytorch to make sure that the pinned memory will not be freed unless kernel launched asynchronously in the CUDA stream.

Could you point me to the line of code to check this behavior, please?

I have found non_blocking=True to be very dangerous when going from GPU->CPU. For example:

import torch
action_gpu = torch.tensor([1.0], device=torch.device('cuda'), pin_memory=True)
action_cpu ='cpu'), non_blocking=True)


tensor([1.], device='cuda:0')

Process finished with exit code 0

Any idea why the tensors are not equal? I would expect the thread to block until the transfer from the GPU is finished.

1 Like

facing similar issue.

it looks like setting non_blocking=True when going from gpu to cpu does not make much sens if you intend to use data right away because it is not safe.
in the other way around, cuda kernel will wait for the transfer to end to start computing on gpu.
but when going from gpu to cpu, it is the cpu that will compute. and it does not seem to be aware of the transfer. tensor are created on cpu probably with zero values, but the transfer did not finish yet. for the cpu, tensors are already there, so it starts computing… with the wrong values. cpu will know that the transfer is done only when explicitly asks cuda using torch.cuda.synchronize() for instance.

@ptrblck any insights on how to make transfer gpu-to-cpu safe while being fast, ie non-blocking to True? thanks

reading other posts, and it seems that copying from gpu-to-cpu in non-blocking=True could be a huge risk unless you are planning to use the tensors long time after the CALL for transfer which is expected to finish by the time you want to access the data. the same thing when doing cpu-to-gpu. in that case, it is cuda that will block the gpu from using the data if it is not ready yet as mentioned somewhere in this thread. asynchronous transfer is like background threads… if you intend to access the results of the transfer before the threads end their job, you may use the wrong data. this aspect does not seem to be controlled on the cpu side…


        import time
        # ....
        # x: cuda tensor
        min_x = x.min()
        max_x = x.max()

        t = (min_x - max_x).to(torch.device("cpu"), non_blocking=True)


tensor(-254.)  # the right value: min_x = 0, max_x= 254, t = 0 - 254 = -254.

so, no to gpu-to-cpu transfer with non-blocking=true unless you intend to use the transferred data very later on. and even than, you wont be sure if the transfer has been done yet or not.

note that python print creates also a synchronization point to move the tensor to cpu first before accessing its content. but, because the lazy transfer has already created the tensor in cpu, print just reads its -false- content.


@sbelharbi I have tried the same code as given in the example and cannot reproduce it. Could you please tell me which environment you are using?

sorry, not sure if i mentioned that.
but the code i provided in the example is from my code where x is the result of a forward in large network. so a simple snippet wont work.
the forward needs to be long enough so the cuda kernels are launched but not finished, while the cpu has moved to next instructions such as print(t).

here is a full dummy example. in this example, because i synched after computing z, it is not the forward that is slow, it the the min/max op over a large tensor.

import time

import torch

class Module(torch.nn.Module):
    def forward(self) -> torch.Tensor:
        x = torch.rand(32, 256, 220, 220).cuda()
        s = torch.rand(32, 256, 220, 220).cuda()
        conv = torch.nn.Conv2d(256, 256, 3).cuda()
        z1 = torch.pow(x, 2)
        z2 = z1 / 1000.
        z3 = conv(z2) + conv(s)
        z3 = conv(torch.pow(z3, 2) + z3 * 2 / 100.)
        z4 = z3 / 100.

        return z4

if __name__ == '__main__':
    inst = Module().cuda()
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    z = inst()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    print('time : {}'.format(elapsed_time_ms))

    t = (z.min() - z.max()).to(torch.device("cpu"), non_blocking=True)


time : 7206.33154296875
tensor(0., grad_fn=<CopyBackwards>)
tensor(-0.0099, grad_fn=<CopyBackwards>)

here is a simple snippet with large tensors:

import time

import torch

if __name__ == '__main__':
    seed = 0

    x = torch.rand(32, 256, 220, 220).cuda()

    t = (x.min() - x.max()).to(torch.device("cpu"), non_blocking=True)



i used to collect info. let me know if you need more. thanks

$ python
Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.10

Python version: 3.7.9 (default, Aug 31 2020, 12:42:55)  [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-122-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 10.0.130
GPU models and configuration: 
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB

Nvidia driver version: 455.32.00
cuDNN version: /usr/lib/x86_64-linux-gnu/
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.7.0
[pip3] numpy==1.20.1
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] efficientnet-pytorch      0.7.0                    pypi_0    pypi
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] mkl                       2021.3.0           h06a4308_520  
[conda] numpy                     1.20.1                   pypi_0    pypi
[conda] pytorch                   1.9.0           py3.7_cuda11.1_cudnn8.0.5_0    pytorch
[conda] torchvision               0.10.0               py37_cu111    pytorch

Sorry for picking up on old thread but I found your statement interesting.

Without any in-depth knowledge of how GPUs work, I assume that this means: The data transfer to the GPU can happen independent of the computation i.e. tensor transformations which is why non_blocking=True is a good option.

If, however, we wanted to do something that changes the data itself, say normalize it along some dimension, then its not really going to help because that updated data will have to be readied before output=model(data) part.

Is this understanding of mine largely correct?

If I understand your description correctly, your general understanding should be correct.
Asynchronous operation would allow you to execute other operations in the meantime while the async operation is being executed in the background. If you have a data dependency between both tasks, the execution of the data-dependent operation would need to wait.

1 Like

Hello ! How to use non_blocking=True in C++ in libtorch?

The same methods should also accept the non_blocking argument e.g. as seen in:

Module::to(at::Device device, at::ScalarType dtype, bool non_blocking)
1 Like

Thanks Patrick, for the example! I tryed this, seem worked

  int height =400;
  int width = 400;
  std::vector<int64_t> dims = { 1, height, width, 3 };
  auto options = torch::TensorOptions().dtype(torch::kUInt8).device({ torch::kCUDA }).requires_grad(false);
  torch::Tensor tensor_image_style2 = torch::zeros(dims, options);
  bool non_blocking = true;
  bool copy_flag = false;
  tensor_image_style2 =, non_blocking, copy_flag);