I’ve read the document saying that if we have pinned memory, we could set non_blocking
to true
. Will this result in anything bad in our code? Like in my code, after doing data transferring ( data = data.to(device, non_blocking=True
), I will call the forward method of the model. In this case, is there any difference between non_blocking
is true or not, as for forward
it has to wait for finishing data transferring?
If the next operation depends on your data
, you won’t notice any speed advantage.
However, if the asynchronous data transfer is possible, you might hide the transfer time in another operation.
Did you encounter any strange issues using non_blocking=True
?
Nah. Everything seems fine. Could you give a quick example on which (common) cases we should use non_blocking?
Well, if you don’t have any synchronization points in your training loop (e.g. pushing the model output to the CPU), use pin_memory=True
in your DataLoader
, the data transfer should be overlapped by the kernel execution:
for data, target in loader:
# Overlapping transfer if pinned memory
data = data.to('cuda:0', non_blocking=True)
target = target.to('cuda:0', non_blocking=True)
# The following code will be called asynchronously,
# such that the kernel will be launched and returns control
# to the CPU thread before the kernel has actually begun executing
output = model(data) # has to wait for data to be pushed onto device (synch point)
loss = criterion(output, target)
loss.backward()
optimizer.step()
Here are some more in-depth information from the NVIDIA devblog.
Thanks for the example! Will that be a good idea if I am using data parallel? In my mind, at least it will not hurt?
I would just try it and compare the wall time.
If there are any synchronization points, you should still end up with the same time as with non_blocking=False
in the worst case.
It looks there is no real disadvantage in using “non_blocking=True”.
Why not to make it a default parameter?
Do you know what the expected behaviour is if we set non_blocking=True
and pin_memory=False
?
Is this dangerous or just a harmless no-op?
Thanks
It should be harmless and I’m not aware of any side effects, but please let us know, if you see something weird.
Thanks. Are you able to point me to the source for this method? I couldn’t find it and I’d like to check what it does if pin_memory==False
. I’ve been having some issues with dataloaders hanging when num_workers > 0 and I’m wondering if it’s this.
In this code, you mention that output = model(data)
is a synch point, which means that this code will not be executed asynchronously?
Hi, ptrblck
I have the same concerns of this: https://stackoverflow.com/questions/63460538/proper-usage-of-pytorchs-non-blocking-true-for-data-prefetching
output = model(data)
is not synchronizing in itself, but would have to wait for the data
to be transferred to the device. Sorry, if the explanation was confusing.
@brynhayder Pinned memory is a finite resource and allocating excessive amounts of pinned memory will slow down your system. This is especially true for 3D data or very large batch sizes.
if we set non_blocking=True
and pin_memory=False
, I think it should be dangerous because there is a CachingHostAllocator in Pytorch to make sure that the pinned memory will not be freed unless kernel launched asynchronously in the CUDA stream.
Could you point me to the line of code to check this behavior, please?
I have found non_blocking=True to be very dangerous when going from GPU->CPU. For example:
import torch
action_gpu = torch.tensor([1.0], device=torch.device('cuda'), pin_memory=True)
print(action_gpu)
action_cpu = action_gpu.to(torch.device('cpu'), non_blocking=True)
print(action_cpu)
output
tensor([1.], device='cuda:0')
tensor([0.])
Process finished with exit code 0
Any idea why the tensors are not equal? I would expect the thread to block until the transfer from the GPU is finished.
facing similar issue.
it looks like setting non_blocking=True
when going from gpu to cpu does not make much sens if you intend to use data right away because it is not safe.
in the other way around, cuda kernel will wait for the transfer to end to start computing on gpu.
but when going from gpu to cpu, it is the cpu that will compute. and it does not seem to be aware of the transfer. tensor are created on cpu probably with zero values, but the transfer did not finish yet. for the cpu, tensors are already there, so it starts computing… with the wrong values. cpu will know that the transfer is done only when explicitly asks cuda using torch.cuda.synchronize()
for instance.
@ptrblck any insights on how to make transfer gpu-to-cpu safe while being fast, ie non-blocking to True? thanks
reading other posts, and it seems that copying from gpu-to-cpu in non-blocking=True could be a huge risk unless you are planning to use the tensors long time after the CALL for transfer which is expected to finish by the time you want to access the data. the same thing when doing cpu-to-gpu. in that case, it is cuda that will block the gpu from using the data if it is not ready yet as mentioned somewhere in this thread. asynchronous transfer is like background threads… if you intend to access the results of the transfer before the threads end their job, you may use the wrong data. this aspect does not seem to be controlled on the cpu side…
example:
import time
# ....
# x: cuda tensor
min_x = x.min()
max_x = x.max()
t = (min_x - max_x).to(torch.device("cpu"), non_blocking=True)
print(t)
time.sleep(2.)
print(t)
output:
tensor(0.)
tensor(-254.) # the right value: min_x = 0, max_x= 254, t = 0 - 254 = -254.
so, no to gpu-to-cpu transfer with non-blocking=true unless you intend to use the transferred data very later on. and even than, you wont be sure if the transfer has been done yet or not.
note that python print
creates also a synchronization point to move the tensor to cpu first before accessing its content. but, because the lazy transfer has already created the tensor in cpu, print
just reads its -false- content.
@sbelharbi I have tried the same code as given in the example and cannot reproduce it. Could you please tell me which environment you are using?
sorry, not sure if i mentioned that.
but the code i provided in the example is from my code where x is the result of a forward in large network. so a simple snippet wont work.
the forward needs to be long enough so the cuda kernels are launched but not finished, while the cpu has moved to next instructions such as print(t)
.
here is a full dummy example. in this example, because i synched after computing z, it is not the forward that is slow, it the the min/max op over a large tensor.
import time
import torch
class Module(torch.nn.Module):
def forward(self) -> torch.Tensor:
x = torch.rand(32, 256, 220, 220).cuda()
s = torch.rand(32, 256, 220, 220).cuda()
conv = torch.nn.Conv2d(256, 256, 3).cuda()
z1 = torch.pow(x, 2)
z2 = z1 / 1000.
z3 = conv(z2) + conv(s)
z3 = conv(torch.pow(z3, 2) + z3 * 2 / 100.)
z4 = z3 / 100.
return z4
if __name__ == '__main__':
inst = Module().cuda()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
z = inst()
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
print('time : {}'.format(elapsed_time_ms))
t = (z.min() - z.max()).to(torch.device("cpu"), non_blocking=True)
print(t)
time.sleep(2.)
print(t)
output
time : 7206.33154296875
tensor(0., grad_fn=<CopyBackwards>)
tensor(-0.0099, grad_fn=<CopyBackwards>)
here is a simple snippet with large tensors:
import time
import torch
if __name__ == '__main__':
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
x = torch.rand(32, 256, 220, 220).cuda()
t = (x.min() - x.max()).to(torch.device("cpu"), non_blocking=True)
print(t)
time.sleep(2.)
print(t)
output:
tensor(0.)
tensor(-1.0000)
i used https://raw.githubusercontent.com/pytorch/pytorch/master/torch/utils/collect_env.py
to collect info. let me know if you need more. thanks
$ python collect_env.py
Collecting environment information...
PyTorch version: 1.9.0
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.10
Python version: 3.7.9 (default, Aug 31 2020, 12:42:55) [GCC 7.3.0] (64-bit runtime)
Python platform: Linux-4.15.0-122-generic-x86_64-with-debian-buster-sid
Is CUDA available: True
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: Tesla P100-PCIE-16GB
GPU 1: Tesla P100-PCIE-16GB
Nvidia driver version: 455.32.00
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.4.2
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] efficientnet-pytorch==0.7.0
[pip3] numpy==1.20.1
[pip3] torch==1.9.0
[pip3] torchvision==0.10.0
[conda] blas 1.0 mkl
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] efficientnet-pytorch 0.7.0 pypi_0 pypi
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] mkl 2021.3.0 h06a4308_520
[conda] numpy 1.20.1 pypi_0 pypi
[conda] pytorch 1.9.0 py3.7_cuda11.1_cudnn8.0.5_0 pytorch
[conda] torchvision 0.10.0 py37_cu111 pytorch