Model inference very slow when batch_size changes for the first time

(claude) #1

Hi all,
I found an interesting problem when deploy my model. Here is the situation, I am doing segmentation on 3D CT images, my model is a 2D Unet, to get the overall segmentation result in 3D, the 2D Unet needs to run slice by slice. The inference BATCH_SIZE is set as 2, which means two adjcent slices will be extracted and inference at the same time. If a CT image has odd number of slices, the last batch would only have one slice as input. The problem is that the running time of the first inference and the last inference is much longer than average.

torch.backends.cudnn.benchmark = True

Code is like the following:

img        # a 9*1*512*512 float cuda tensor, for 9 slice of CT
for i in range(img.shape[0] // BATCH_SIZE):
    net_input = img[BATCH_SIZE*i:BATCH_SIZE*(i+1), :, :, :]
    start = time.time()
    output = model(net_input)
    inference_time = time.time() - start
    print('inference time: %.5f'%inference_time)

if img.shape[0] % BATCH_SIZE > 0:
    net_input = img[BATCH_SIZE*(i+1):, :, :, :]
    start = time.time()
    output = model(net_input)
    inference_time = time.time() - start
    print('inference time: %.5f' %inference_time)

The inference time result is as follow:
inference time: 1.30878 # batch size 2
inference time: 0.02359 # batch size 2
inference time: 0.02301 # batch size 2
inference time: 0.02312 # batch size 2
inference time: 1.30119 # batch size 1

The inference time of the first time and the last time is much slower than the normal ones, this problem can be reproduce anytime when batch size change larger, anyone knows why?


This might be due to the internal benchmarking of cuDNN, if you set torch.backends.cudnn.benachmark = True.
If your input size changes, cuDNN will run different algorithms and benchmark them to chose the most performant one for this particular shape.

Also note that you are missing a synchronization in the last part of your script.

(claude) #3

Thanks for reply, the missing of the last synchronization is a typo, the actual script is corret, sorry for that mistake.

When change torch.backends.cudnn.benchmark to False, the inference time are the same for all batches. However, another problem occurs, the inference time of all batches get much slower, because there are dilated convolutions in my model. Accoding to this issue, set cudnn as True can solve it, so these two problems are contradictory. Any idea to solve it ?

Thanks a lot.


There is also a “hacky” solution you could use.

You could use a fixed batch size and initialize the whole batch with zeros.
Then you add the input data to the first few indices of the batch and process it.
Finally, you only use the indices of the batch which are valid.

(claude) #5

That’s a feasible solution, although not so elegent. I am wandering, is it possible to mannually set convolution algrithm in cudnn?