Hi, I was training my model but I noticed something strange, I rewrited my code for easy look, look at these code below :

``````matched = torch.LongTensor(8732).to("cuda")
for epoch in range(config.start_epoch, config.end_epoch):

imgs = imgs.to(config.device)
targets = [target.to(config.device) for target in targets]

loc, conf = net(imgs)

for i in range(5):
t = time.time()
matched[0] = -1
print(time.time() - t)

print("finished 1 iteration!")
``````

I created `matched` tensor and pass `image` into my neural network, after that, I measure how much time it take to assign just 1 element in matched :

``````finished 1 iteration!
0.20601940155029297
0.0
0.0
0.0
0.006075859069824219
finished 1 iteration!
0.21062469482421875
0.0
0.0
0.0
0.0
finished 1 iteration!
``````

You can see that the first time I assign `matched[0]` take so much time, but when I donâ€™t pass images into neural network, it take fewer time :

``````matched = torch.LongTensor(8732).to("cuda")
for epoch in range(config.start_epoch, config.end_epoch):

imgs = imgs.to(config.device)
targets = [target.to(config.device) for target in targets]

#loc, conf = net(imgs)  not pass images in to network

for i in range(5):
t = time.time()
matched[0] = -1
print(time.time() - t)

print("finished 1 iteration!")

``````

the result is :

``````0.0
0.0
0.0010001659393310547
0.0
0.0
finished 1 iteration!
0.0
0.0
0.0
0.0
0.0
finished 1 iteration!
``````

I donâ€™t understand why, please let me know the reason

// config.device = â€ścudaâ€ť in my code

CUDA operations are executed asynchronously so you would need to synchronize the code via `torch.cuda.synchronize()` before starting and stopping the timers. Otherwise you would measure the dispatching, kernel launches, and implicit synchronizations in your code.

It has been very helpful to me, thank you