Inference on GPU slows after a few iterations

jrmullaney · June 8, 2020, 7:40am

I’ve run into a bit of a problem when trying to utilise a pre-trained model. I’ve written the model myself based around a UNet architecture.

When iterating inference on a GPU (Nvidia GeForce GTX 750 Ti), I find that it infers incredibly quickly for the first ~25 iterations, then slows dramatically for the remaining iterations.

For example, running the following:

from convDiff_model import convDiff
import torch
import time

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

net = convDiff()
net.to(device)

ran = torch.rand(1, 2, 256, 256).to(device)

with torch.no_grad():
    tic = time.process_time()
    for i in range(48):
        net(ran.float())
        if i % 5 == 4:
            toc = time.process_time()
            print('Iter. %2d to %2d: Mean time: %.3f' % (i-4, i+1, (toc - tic) / 5.) )
            tic = time.process_time()

produces the output:

Iter.  0 to  5: Mean time: 0.004
Iter.  5 to 10: Mean time: 0.002
Iter. 10 to 15: Mean time: 0.002
Iter. 15 to 20: Mean time: 0.002
Iter. 20 to 25: Mean time: 0.084
Iter. 25 to 30: Mean time: 0.442
Iter. 30 to 35: Mean time: 0.442
Iter. 35 to 40: Mean time: 0.442
Iter. 40 to 45: Mean time: 0.444

so you can see that there’s a factor of ~200x slowdown between the first twenty iterations and the last twenty. Oddly, it seems that including/removing with torch.no_grad(): makes no difference. Inference remains fast if I empty_cache() every, say, 15 iterations, but that in itself uses 0.45 second, which negates it as a working solution.

[To clarify, in the above code snippet, I’ve removed the loading of the trained model, as this particular behaviour doesn’t seem to depend on the weightings.]

The full model architecture is available at this GitHub repo: https://github.com/jrmullaney/ConvDiff/tree/master/code

Is this behaviour normal? I’d have thought each iteration would be independent.

Apologies if this has already been addressed in another post; I had a good look round but couldn’t find any solutions.

Thanks for your help.

ptrblck · June 8, 2020, 9:03am

CUDA operations are asynchronous, so you would need to synchronize the code before starting and stopping the timer via torch.cuda.synchronize().
Currently the first iterations are most likely just profiling the kernel launches and will then block in the net(ran) line of code after a few iterations.

jrmullaney · June 8, 2020, 9:45am

Thanks, @ptrblck, for the very quick response.

Indeed, when I add torch.cuda.synchronize() in the following location:

with torch.no_grad():
    tic = time.process_time()
    for i in range(48):
        net(ran.float())
        if i % 5 == 4:
            toc = time.process_time()
            print('Iter. %2d to %2d: Mean time: %.3f' % (i-4, i+1, (toc - tic) / 5.) )
            torch.cuda.synchronize()
            tic = time.process_time()

I get:

Iter.  0 to  5: Mean time: 0.005
Iter.  5 to 10: Mean time: 0.002
Iter. 10 to 15: Mean time: 0.002
Iter. 15 to 20: Mean time: 0.003
Iter. 20 to 25: Mean time: 0.002
Iter. 25 to 30: Mean time: 0.002
Iter. 30 to 35: Mean time: 0.002
Iter. 35 to 40: Mean time: 0.002
Iter. 40 to 45: Mean time: 0.002

but the wall clock time between each of those lines is a good 6-8 second. So, I guess each iteration is getting pushed through the GPU quickly, but then a backlog forms after a few iterations (if I interpreted your reply correctly; I’m new to PyTorch and CUDA).

Thanks for clearing that up. Presumably I’ll have to tweak my architecture if I want major speedups?

ptrblck · June 8, 2020, 9:53am

The toc counter isn’t synchronized, so while the GPU is busy in net(ran.float()), you are immediately stopping the time, print the profiling result, synchronize (and thus wait until the GPU is finished), start the time and stop it immediately after a few iterations.
Try to add another torch.cuda.synchronize() before the toc call and rerun the code.

jrmullaney · June 8, 2020, 10:13am

Thanks again,

With the following:

with torch.no_grad():
    tic = time.process_time()
    for i in range(48):
        net(ran.float())
        if i % 5 == 4:
            torch.cuda.synchronize()
            toc = time.process_time()
            print('Iter. %2d to %2d: Mean time: %.3f' % (i-4, i+1, (toc - tic) / 5.) )
            torch.cuda.synchronize()
            tic = time.process_time()

[is that what you meant?], I get:

Iter.  0 to  5: Mean time: 1.782
Iter.  5 to 10: Mean time: 1.778
Iter. 10 to 15: Mean time: 1.781
Iter. 15 to 20: Mean time: 1.780
Iter. 20 to 25: Mean time: 1.783
Iter. 25 to 30: Mean time: 1.782
Iter. 30 to 35: Mean time: 1.782
Iter. 35 to 40: Mean time: 1.779
Iter. 40 to 45: Mean time: 1.779

So the 48 iterations take about 80-90 seconds (wall time) in total.

Is that what you’d expect?

Thanks,

jrmullaney · June 8, 2020, 10:23am

Oh - and I should say. I’m not hung up about the tics and tocs - those were just for demonstration purposes to profile how the time per iteration changes throughout the loop. The important thing is the total wall time to process a bunch of images.

In production, I’d like this to infer around 100 512x512 images at a time. If this took sub-1s total (i.e., 100*0.002s as in the first few iterations in my first post), this would be great, and even sub-10s would be fine, but 45s is not really viable.

ptrblck · June 8, 2020, 10:31am

The new timing should reflect the time for 5 iterations.
This should fit the processing time for the complete dataset, if you scale the number up.

I might misunderstand your profiling, but you are accumulating the time for 5 iterations, which would finish in 1.78/5*48~=17s.

jrmullaney · June 8, 2020, 11:09am

Ah. No - that’s not the accumulation of 5, it’s the average over five. With the addition of the synchronise() step, it now corresponds to:
( t0 + t1 + t2 + t3 + t4 + synchronise() ) / 5.
where t0 etc is the time it takes to do the first iteration of net() etc.

If I instead do:

with torch.no_grad():

    for i in range(48):
        tic = time.process_time()
        net(ran.float())
        toc = time.process_time()
        print('Iter. %2d : time: %.3f' % (i, (toc - tic)) )

I get:

Iter.  0 : time: 0.019
Iter.  1 : time: 0.002
Iter.  2 : time: 0.002
Iter.  3 : time: 0.002
.....
Iter. 21 : time: 0.002
Iter. 22 : time: 0.002
Iter. 23 : time: 0.094
Iter. 24 : time: 1.777
Iter. 25 : time: 1.743
Iter. 26 : time: 1.781
.....
Iter. 44 : time: 1.750
Iter. 45 : time: 1.775
Iter. 46 : time: 1.779
Iter. 47 : time: 1.774

where the ..... correspond to lines showing virtually the same times (I’ve removed the synchronised() to give a truer indication of the time per inference).

With those times, the total compute time for 100 images will be:
0.002 x 24 + 1.77 x 76 = 135s

ptrblck · June 8, 2020, 8:22pm

Thanks for the correction. I missed the average in the calculation.

You could set torch.backends.cudnn.benchmark = True at the beginning of your script to use the cudnn heuristics in order to find the fastest kernels for your current workload.

Additionally, you could check out automatic-mixed precision (available in master and the nightly binaries), especially if you are using a newer GPU with TensorCores (which unfortunately shouldn’t help much with the GTX 750).

jrmullaney · June 9, 2020, 8:12am

OK - I’ll give those a try.

Thanks for your help and speedy replies! Very much appreciated.