# Inference on GPU slows after a few iterations

I’ve run into a bit of a problem when trying to utilise a pre-trained model. I’ve written the model myself based around a UNet architecture.

When iterating inference on a GPU (Nvidia GeForce GTX 750 Ti), I find that it infers incredibly quickly for the first ~25 iterations, then slows dramatically for the remaining iterations.

For example, running the following:

``````from convDiff_model import convDiff
import torch
import time

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

net = convDiff()
net.to(device)

ran = torch.rand(1, 2, 256, 256).to(device)

tic = time.process_time()
for i in range(48):
net(ran.float())
if i % 5 == 4:
toc = time.process_time()
print('Iter. %2d to %2d: Mean time: %.3f' % (i-4, i+1, (toc - tic) / 5.) )
tic = time.process_time()
``````

produces the output:

``````Iter.  0 to  5: Mean time: 0.004
Iter.  5 to 10: Mean time: 0.002
Iter. 10 to 15: Mean time: 0.002
Iter. 15 to 20: Mean time: 0.002
Iter. 20 to 25: Mean time: 0.084
Iter. 25 to 30: Mean time: 0.442
Iter. 30 to 35: Mean time: 0.442
Iter. 35 to 40: Mean time: 0.442
Iter. 40 to 45: Mean time: 0.444
``````

so you can see that there’s a factor of ~200x slowdown between the first twenty iterations and the last twenty. Oddly, it seems that including/removing `with torch.no_grad():` makes no difference. Inference remains fast if I `empty_cache()` every, say, 15 iterations, but that in itself uses 0.45 second, which negates it as a working solution.

[To clarify, in the above code snippet, I’ve removed the loading of the trained model, as this particular behaviour doesn’t seem to depend on the weightings.]

The full model architecture is available at this GitHub repo: https://github.com/jrmullaney/ConvDiff/tree/master/code

Is this behaviour normal? I’d have thought each iteration would be independent.

Apologies if this has already been addressed in another post; I had a good look round but couldn’t find any solutions.

2 Likes

CUDA operations are asynchronous, so you would need to synchronize the code before starting and stopping the timer via `torch.cuda.synchronize()`.
Currently the first iterations are most likely just profiling the kernel launches and will then block in the `net(ran)` line of code after a few iterations.

Thanks, @ptrblck, for the very quick response.

Indeed, when I add `torch.cuda.synchronize()` in the following location:

``````with torch.no_grad():
tic = time.process_time()
for i in range(48):
net(ran.float())
if i % 5 == 4:
toc = time.process_time()
print('Iter. %2d to %2d: Mean time: %.3f' % (i-4, i+1, (toc - tic) / 5.) )
torch.cuda.synchronize()
tic = time.process_time()
``````

I get:

``````Iter.  0 to  5: Mean time: 0.005
Iter.  5 to 10: Mean time: 0.002
Iter. 10 to 15: Mean time: 0.002
Iter. 15 to 20: Mean time: 0.003
Iter. 20 to 25: Mean time: 0.002
Iter. 25 to 30: Mean time: 0.002
Iter. 30 to 35: Mean time: 0.002
Iter. 35 to 40: Mean time: 0.002
Iter. 40 to 45: Mean time: 0.002
``````

but the wall clock time between each of those lines is a good 6-8 second. So, I guess each iteration is getting pushed through the GPU quickly, but then a backlog forms after a few iterations (if I interpreted your reply correctly; I’m new to PyTorch and CUDA).

Thanks for clearing that up. Presumably I’ll have to tweak my architecture if I want major speedups?

The `toc` counter isn’t synchronized, so while the GPU is busy in `net(ran.float())`, you are immediately stopping the time, print the profiling result, synchronize (and thus wait until the GPU is finished), start the time and stop it immediately after a few iterations.
Try to add another `torch.cuda.synchronize()` before the `toc` call and rerun the code.

Thanks again,

With the following:

``````with torch.no_grad():
tic = time.process_time()
for i in range(48):
net(ran.float())
if i % 5 == 4:
torch.cuda.synchronize()
toc = time.process_time()
print('Iter. %2d to %2d: Mean time: %.3f' % (i-4, i+1, (toc - tic) / 5.) )
torch.cuda.synchronize()
tic = time.process_time()
``````

[is that what you meant?], I get:

``````Iter.  0 to  5: Mean time: 1.782
Iter.  5 to 10: Mean time: 1.778
Iter. 10 to 15: Mean time: 1.781
Iter. 15 to 20: Mean time: 1.780
Iter. 20 to 25: Mean time: 1.783
Iter. 25 to 30: Mean time: 1.782
Iter. 30 to 35: Mean time: 1.782
Iter. 35 to 40: Mean time: 1.779
Iter. 40 to 45: Mean time: 1.779
``````

So the 48 iterations take about 80-90 seconds (wall time) in total.

Is that what you’d expect?

Thanks,

Oh - and I should say. I’m not hung up about the tics and tocs - those were just for demonstration purposes to profile how the time per iteration changes throughout the loop. The important thing is the total wall time to process a bunch of images.

In production, I’d like this to infer around 100 512x512 images at a time. If this took sub-1s total (i.e., 100*0.002s as in the first few iterations in my first post), this would be great, and even sub-10s would be fine, but 45s is not really viable.

The new timing should reflect the time for 5 iterations.
This should fit the processing time for the complete dataset, if you scale the number up.

I might misunderstand your profiling, but you are accumulating the time for 5 iterations, which would finish in `1.78/5*48~=17s`.

Ah. No - that’s not the accumulation of 5, it’s the average over five. With the addition of the `synchronise()` step, it now corresponds to:
( t0 + t1 + t2 + t3 + t4 + synchronise() ) / 5.
where t0 etc is the time it takes to do the first iteration of net() etc.

``````with torch.no_grad():

for i in range(48):
tic = time.process_time()
net(ran.float())
toc = time.process_time()
print('Iter. %2d : time: %.3f' % (i, (toc - tic)) )
``````

I get:

``````Iter.  0 : time: 0.019
Iter.  1 : time: 0.002
Iter.  2 : time: 0.002
Iter.  3 : time: 0.002
.....
Iter. 21 : time: 0.002
Iter. 22 : time: 0.002
Iter. 23 : time: 0.094
Iter. 24 : time: 1.777
Iter. 25 : time: 1.743
Iter. 26 : time: 1.781
.....
Iter. 44 : time: 1.750
Iter. 45 : time: 1.775
Iter. 46 : time: 1.779
Iter. 47 : time: 1.774
``````

where the `.....` correspond to lines showing virtually the same times (I’ve removed the `synchronised()` to give a truer indication of the time per inference).

With those times, the total compute time for 100 images will be:
0.002 x 24 + 1.77 x 76 = 135s

Thanks for the correction. I missed the average in the calculation.

You could set `torch.backends.cudnn.benchmark = True` at the beginning of your script to use the cudnn heuristics in order to find the fastest kernels for your current workload.

Additionally, you could check out automatic-mixed precision (available in master and the nightly binaries), especially if you are using a newer GPU with TensorCores (which unfortunately shouldn’t help much with the GTX 750).

OK - I’ll give those a try.

Thanks for your help and speedy replies! Very much appreciated.