Slow iteration over tensor elements, slow any()

I found that .tolist() was kind of slow but the I was puzzled by the slowness of over-tensor iteration.

Maybe I am doing something wrong, here is a snippet:

import torch
from math import cos
import time

def test_with_list(topk_ids):
    true_value = 0
    false_value = 0
    topk_ids_list = topk_ids.tolist()
    for i in range(len(topk_ids_list)):
        if any(topk_ids_list[i]):
            x = cos(i)
            false_value += 1
        else:
            true_value += 1
    return true_value, false_value

def test_with_tensor(topk_ids):
    true_value = 0
    false_value = 0
    for i in range(topk_ids.size(0)):
        if topk_ids[i].any():
            x = cos(i)
            false_value += 1
        else:
            true_value += 1
    return true_value, false_value


def main():
    topk_ids = torch.randint(0, 2, (1000, 4), device=torch.device("cuda"))
    
    torch.cuda.synchronize()
    beg_time = time.time()
    for i in range(1000):
        true_value, false_value = test_with_list(topk_ids)
    torch.cuda.synchronize()
    print(time.time() - beg_time)
    
    beg_time = time.time()
    for i in range(1000):
        true_value, false_value = test_with_tensor(topk_ids)
    torch.cuda.synchronize()
    print(time.time() - beg_time)

    topk_ids = topk_ids.cpu()
    beg_time = time.time()
    for i in range(1000):
        true_value, false_value = test_with_tensor(topk_ids)
    torch.cuda.synchronize()
    print(time.time() - beg_time)

if __name__ == "__main__":
    main()

Results:

0.19742751121520996
13.819873332977295
3.40899920463562

So, iterating and using any(list) is quite fast
iterating and using tensor.any() when tensor on gpu is way slow, a bit better when on cpu.

My initial questioning was about the .tolist() and trying to find a faster way.

Thanks.

You are synchronizing the code in every iteration by using data-dependent control flow on a CUDATensor:

if topk_ids[i].any():

The tensor itself is tiny to your GPU should show a low utilization since you are just waiting and transferring data most of the time.

sorry I don’t get it.

is tensor.any() transferring data ? not just a test on value directly on the tensor on the device ?

You are using a Python if condition checking if any value of a CUDATensor meets a specific condition, so yes: this line of code will synchronize the code. Otherwise your Python program wouldn’t know which path to take. Python itself is running on the host and thus needs to read the value from the GPU.

You can also verify it manually:

x = torch.randn(10, 10, device="cuda")

torch.cuda.set_sync_debug_mode("warn")

if x.any():
    y = x.sum()
#UserWarning: called a synchronizing CUDA operation (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:147.)
#  if x.any():

ok.
As you can see in the top snippet it takes 197ms for 1000 calls on the tensor.tolist() with a tensor dim (1000, 4)

In my application I have

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               aten::to         1.39%      68.932ms        24.81%        1.226s     147.217us       0.000us         0.00%      33.922ms       4.073us       1.53 Mb      86.49 Kb     193.15 Mb      19.86 Mb          8329  
                                         aten::_to_copy         0.57%      28.289ms        24.71%        1.221s     160.289us       0.000us         0.00%      35.253ms       4.628us       1.60 Mb     -26.12 Kb     193.15 Mb           0 b          7618  
                                            aten::copy_         0.99%      48.753ms        23.01%        1.137s      89.709us      23.280ms         1.16%      54.509ms       4.299us     214.80 Kb     214.80 Kb           0 b           0 b         12679  
                                        cudaMemcpyAsync        20.57%        1.017s        20.57%        1.017s      94.364us      17.064ms         0.85%      17.064ms       1.584us           0 b           0 b           0 b           0 b         10775  
                                     is_finished_tolist        -0.82%  -40638.000us        19.40%     959.056ms       3.165ms       0.000us         0.00%     300.000us       0.990us           0 b    -374.33 Kb           0 b           0 b           303  

303 calls with a tensor size max (960, 4) (more often way less than 960)

the calling code of is_finished_tolist being:

        self.topk_ids = self.topk_ids.eq(self.eos)
        with record_function("is_finished_tolist"):
            self.is_finished_list = self.topk_ids.tolist()

so trying to record ONLY the tolist() operation and it takes 959ms

is it somehow a question of CUDA waiting for something else to finish ?

Nevermind I fully understand now.