You are using a Python if condition checking if any value of a CUDATensor meets a specific condition, so yes: this line of code will synchronize the code. Otherwise your Python program wouldn’t know which path to take. Python itself is running on the host and thus needs to read the value from the GPU.
You can also verify it manually:
x = torch.randn(10, 10, device="cuda")
torch.cuda.set_sync_debug_mode("warn")
if x.any():
y = x.sum()
#UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
# if x.any():
ok.
As you can see in the top snippet it takes 197ms for 1000 calls on the tensor.tolist() with a tensor dim (1000, 4)
In my application I have
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls
------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::to 1.39% 68.932ms 24.81% 1.226s 147.217us 0.000us 0.00% 33.922ms 4.073us 1.53 Mb 86.49 Kb 193.15 Mb 19.86 Mb 8329
aten::_to_copy 0.57% 28.289ms 24.71% 1.221s 160.289us 0.000us 0.00% 35.253ms 4.628us 1.60 Mb -26.12 Kb 193.15 Mb 0 b 7618
aten::copy_ 0.99% 48.753ms 23.01% 1.137s 89.709us 23.280ms 1.16% 54.509ms 4.299us 214.80 Kb 214.80 Kb 0 b 0 b 12679
cudaMemcpyAsync 20.57% 1.017s 20.57% 1.017s 94.364us 17.064ms 0.85% 17.064ms 1.584us 0 b 0 b 0 b 0 b 10775
is_finished_tolist -0.82% -40638.000us 19.40% 959.056ms 3.165ms 0.000us 0.00% 300.000us 0.990us 0 b -374.33 Kb 0 b 0 b 303
303 calls with a tensor size max (960, 4) (more often way less than 960)
the calling code of is_finished_tolist being:
self.topk_ids = self.topk_ids.eq(self.eos)
with record_function("is_finished_tolist"):
self.is_finished_list = self.topk_ids.tolist()
so trying to record ONLY the tolist() operation and it takes 959ms
is it somehow a question of CUDA waiting for something else to finish ?