A simple print function could hugely can reduce the significant time delay

I recently found a very weird thing, a simple print could cause significant speedup, which is unexpected.

first you should install the package via:
pip install git+https://github.com/Karbo123/pytorch_hash.git

then you can try the following codes to reproduce:

import torch
from pytorch_hash import pytorch_hash

torch.manual_seed(0)
xyz = torch.randn([10, 3], device="cuda")
depth = torch.randn([10], device="cuda")
uv = xyz[:, :2] / depth.unsqueeze(1)

# print(depth) # magic print function
ind_slot = pytorch_hash(uv.contiguous())
print(ind_slot)
print(depth) # without the magic print, it takes very long time
print("all done!")

you may observe print(depth) takes very long time to execute, but if you add the print(depth) before pytorch_hash executes, it prints very fast. I really wonder what actually happen inside the print function?

Could you explain what “it prints very fast” means here?
print statements are synchronizing the device with the host. If you want to profile specific methods use the torch.utils.benchmark util. or synchronize the code via torch.cuda.synchronize().

Hi, I really appreciate for the quick reply. The real problem here is that, the print() statement causes significant time delay compared to usual print(). We all know generally, printing a tensor should takes very short time (e.g. no more than 1sec).

If you run the codes I provided, you may reproduce the problem, you will find that the simple print(depth) statement will take very long time (e.g. more than 30sec) to print out to the terminal, but actually the tensor depth is quite small (i.e. shape=(10,)), and the tensor size shouldn’t account for the problem, because I think copying the cuda memory to the host memory should be super fast as the memory size is very small.

However, very surprisedly, if you additionally add the print(depth) statement before ind_slot = pytorch_hash(uv.contiguous()), you will find the time delay of the two print(depth) statements will no longer be a problem, they run very quickly as usual (e.g. no more than 1sec).

Also, very interestingly, you can replace the print(depth) to just print one element print(depth[-1]), and it also takes the same effect. But if print(depth) becomes depth[-1], depth, depth.cpu() or print(depth.cpu()), it just cannot reduce the time delay.

I really want to know why the simple print() can hugely reduce the unexpected time delay?

By the way, I use torch.__version__==1.8.0+cu111, and NVCC==V11.1.74.

That’s not entirely true, as it depends where the data comes from.
CUDA operations are executed asynchronously so if you run a code snippet such as:

out = torch.matmul(x, x)

the actual matmul kernel will be executed in the background while the CPU can execute the next line of code.
However, since a print statement needs to see the data, transfer it to the host (which will thus synchronize the code), the actual runtime depends on the GPU execution time.
I.e. the print statement in:

out = torch.malmul(x, x)
print(out)

has to wait for the GPU to finish its execution (assuming x is on the GPU), move out to the host, and finally print it.

Because you are implicitly synchronizing the code and thus get different timings.
Use either the bottleneck util. to profile code or synchronize explicitly via torch.cuda.synchronize() to properly profile the code and get valid times.

Hi, thanks for the detailed reply! I do understand the cuda’s implicit synchronization mechanism. And to check whether explicit synchronization could solve the problem, I try the following codes (only show the core part).

# print(depth) # magic print function
torch.cuda.synchronize()
ind_slot = pytorch_hash(uv.contiguous())
torch.cuda.synchronize()
print(ind_slot)
torch.cuda.synchronize()
print(depth) # without the magic print, it takes very long time
torch.cuda.synchronize()
print("all done!")

You can still to see that, after print(ind_slot) is finished (print to the terminial), it takes very very long time to see the next print result (i.e. print(depth)). By the way, I’ve also tried to run the code with CUDA_LAUNCH_BLOCKING=1 python test.py, but it doesn’t help.

One noteworthy thing is that, the variable depth needn’t to be computed, we just declare it using depth = torch.randn([10], device="cuda").

But if you additionally add print(depth) to the beginning, the second print(depth) runs very fast as expected. This is very strange.

CUDA is not implicitly synchronizing for you, but the print statement will add a sync.

I cannot reproduce any change in the reported time by using or removing the “magic” print statement and get ~3s (2.9-3.1s) on multiple runs.
Maybe creating a profile (e.g. via Nsight Systems) might help you see the execution and check the kernel times.

Thanks, update the pytorch to 1.10.2+cu111 solves the problem.