Hi,
I found a time-consuming hot spot in my code.
It is a indexing operation over a tensor. It is time-consuming when first called.
To simplify the question, I have done a simple experiment as follows.
import torch
import time
def func_1():
a = torch.rand(10)
indice = torch.tensor((1, 2, 3))
for i in range(5):
torch.cuda.synchronize()
st = time.time()
b = a[indice]
torch.cuda.synchronize()
ed = time.time()
print((ed - st) * 1000)
def func_2():
a = torch.rand(10).cuda()
indice = torch.tensor((1, 2, 3)).cuda()
for i in range(5):
torch.cuda.synchronize()
st = time.time()
b = a[indice]
torch.cuda.synchronize()
ed = time.time()
print((ed - st) * 1000)
if __name__ == '__main__':
for i in range(2):
func_1()
time.sleep(5)
print("----")
print("###############")
for i in range(2):
func_2()
time.sleep(5)
print("----")
The code produced output as
0.1461505889892578
0.01430511474609375
0.0095367431640625
0.008821487426757812
0.00858306884765625
----
0.10752677917480469
0.04220008850097656
0.03361701965332031
0.0324249267578125
0.03170967102050781
----
###############
3.1435489654541016
0.10132789611816406
0.053882598876953125
0.048160552978515625
0.04696846008300781
----
0.14138221740722656
0.06222724914550781
0.051975250244140625
0.0476837158203125
0.04649162292480469
----
It is obvious the indexing operation need more time when first called. And if you called it and then turn to do other things, it will cost more when you are back. It seems like a cup of coffee, which need heating continuously…
My environment: pytorch 1.9.0 + py3.8_cuda11.1_cudnn8.0.5_0, device: NVIDIA GeForce RTX 3090.
I need help to explain this phenomenon and then optimize my code…
Thanks for your attention.