Increase computational cost while repeating use the trained CNN models

yuwei_geng · April 22, 2025, 7:54pm

Hi,I have some problems of the neural nework models.
There is a trained CNN models and i need to repeated use this model. Here is the code.
self.CNN.eval()
with torch.no_grad():
for i in range(self.TestSteps):
aa = time.time()
initial = self.CNN(initial)
bb = time.time()
print(bb - aa)
The computational cost for each loop is 0.002s but after a few loops, the computational cost increase to 0.3s.
0.0029914379119873047
0.0009975433349609375
0.0019941329956054688
0.0025336742401123047
0.0010161399841308594
0.001994609832763672
0.0009975433349609375
0.003017902374267578
0.001968860626220703
0.0015769004821777344
0.001043081283569336
0.002002239227294922
0.17526578903198242
0.3638777732849121
0.35142970085144043
0.3363661766052246
0.3505854606628418
0.3499774932861328
0.41155457496643066
0.35362887382507324
0.3807251453399658
Could anyone can help me to figure out this problem, I really appreciate your help.

ptrblck · April 22, 2025, 8:42pm

If you are using your GPU for this model execution you would need to synchronize the code before starting and stopping the host timers since CUDA kernels are executed asynchronously.

yuwei_geng · April 22, 2025, 8:48pm

Is that correct if i rewrite the code as this:
self.CNN.eval()
with torch.no_grad():
for i in range(self.TestSteps):
torch.cuda.synchronize()
aa = time.time()
initial = self.CNN(initial)
torch.cuda.synchronize()
bb = time.time()
print(bb - aa)
but the computation time always be 0.3s with torch.cuda.synchronize().
0.34552931785583496
0.35210347175598145
0.37857532501220703
0.3551297187805176
0.3482997417449951
0.34366416931152344
0.3459787368774414
0.3441014289855957

ptrblck · April 22, 2025, 9:32pm

Your code is not properly formatted, but looks correct from what I can see.

Yes, your previous profiling was invalid and was not profiling the kernel execution time.

yuwei_geng · April 22, 2025, 9:51pm

Thanks for the explanation. Just to confirm — does this mean the actual GPU runtime for each CNN model inference is around 0.3 seconds, rather than the 0.002 seconds I initially observed? I’m curious why it can be that fast initially but cannot consistently maintain that speed. Since I’m working on improving efficiency, I’m hoping to achieve a meaningful speedup by leveraging the trained CNN model.

Additionally, I noticed that when I manually release CUDA memory using torch.cuda.empty_cache(), the inference time remains consistently around 0.002 seconds — but only when I don’t call torch.cuda.synchronize()

self.CNN.eval()
with torch.no_grad():
for i in range(self.TestSteps):
aa = time.time()
initial = self.CNN(initial)
bb = time.time()
print(bb - aa)
torch.cuda.empty_cache()
However, if I include torch.cuda.synchronize(), the measured time still stays around 0.3 seconds.

ptrblck · April 22, 2025, 10:22pm

Yes, since your initial profiling is wrong and is not measuring the GPU kernel execution time.

The kernels were never as fast as your profile showed since you were not waiting for the GPU to complete its execution. Instead you were measuring the Python, C++, backend, etc. overhead to launch the kernel(s) as well as implicit synchronizations.
As mentioned before: CUDA kernels are executed asynchronously so you would need to synchronize host timers before starting and stopping them or you could use CUDA events to measure the execution time on the device.

If you want to measure the GPU kernel execution time with host timers you need to synchronize the code. Otherwise your profiling is wrong and misleading.

yuwei_geng · April 22, 2025, 10:44pm

Thank you so much for the detailed explanation — that makes perfect sense now.