Pytorch tensor indexing consumes much more time when first called

Hi,

I found a time-consuming hot spot in my code.
It is a indexing operation over a tensor. It is time-consuming when first called.
To simplify the question, I have done a simple experiment as follows.

import torch
import time


def func_1():
    a = torch.rand(10)
    indice = torch.tensor((1, 2, 3))
    for i in range(5):
        torch.cuda.synchronize()
        st = time.time()

        b = a[indice]

        torch.cuda.synchronize()
        ed = time.time()
        print((ed - st) * 1000)


def func_2():
    a = torch.rand(10).cuda()
    indice = torch.tensor((1, 2, 3)).cuda()
    for i in range(5):
        torch.cuda.synchronize()
        st = time.time()

        b = a[indice]

        torch.cuda.synchronize()
        ed = time.time()
        print((ed - st) * 1000)


if __name__ == '__main__':
    for i in range(2):
        func_1()
        time.sleep(5)
        print("----")
    print("###############")
    for i in range(2):
        func_2()
        time.sleep(5)
        print("----")

The code produced output as

0.1461505889892578
0.01430511474609375
0.0095367431640625
0.008821487426757812
0.00858306884765625
----
0.10752677917480469
0.04220008850097656
0.03361701965332031
0.0324249267578125
0.03170967102050781
----
###############
3.1435489654541016
0.10132789611816406
0.053882598876953125
0.048160552978515625
0.04696846008300781
----
0.14138221740722656
0.06222724914550781
0.051975250244140625
0.0476837158203125
0.04649162292480469
----

It is obvious the indexing operation need more time when first called. And if you called it and then turn to do other things, it will cost more when you are back. It seems like a cup of coffee, which need heating continuously…

My environment: pytorch 1.9.0 + py3.8_cuda11.1_cudnn8.0.5_0, device: NVIDIA GeForce RTX 3090.

I need help to explain this phenomenon and then optimize my code…

Thanks for your attention.

I cannot reproduce the issue and get quite noisy results for the GPU workload e.g.:

0.0476837158203125
0.11849403381347656
0.06389617919921875
0.053882598876953125
0.05459785461425781
----
0.07605552673339844
0.15306472778320312
0.10085105895996094
0.08416175842285156
0.08344650268554688

Use torch.utils.benchmark to profile specific operations, as it would add warmup iterations and stabilize the timing using multiple iterations. Alternatively, use a profiler to check the actual kernel runtimes and check if your GPU goes into IDLE mode if it’s not used (which is why warmup iters are used during profiling).