Hi all,

I found an interesting problem when deploy my model. Here is the situation, I am doing segmentation on 3D CT images, my model is a 2D Unet, to get the overall segmentation result in 3D, the 2D Unet needs to run slice by slice. The inference BATCH_SIZE is set as 2, which means two adjcent slices will be extracted and inference at the same time. If a CT image has odd number of slices, the last batch would only have one slice as input. The problem is that the running time of the first inference and the last inference is much longer than average.

enviroment:

python=3.6

pytorch=1.0.1

torch.backends.cudnn.benchmark = True

Code is like the following:

```
BATCH_SIZE = 2
img ＃ a 9*1*512*512 float cuda tensor, for 9 slice of CT
for i in range(img.shape[0] // BATCH_SIZE):
net_input = img[BATCH_SIZE*i:BATCH_SIZE*(i+1), :, :, :]
torch.cuda.synchronize()
start = time.time()
output = model(net_input)
torch.cuda.synchronize()
inference_time = time.time() - start
print('inference time: %.5f'%inference_time)
if img.shape[0] % BATCH_SIZE > 0:
net_input = img[BATCH_SIZE*(i+1):, :, :, :]
torch.cuda.synchronize()
start = time.time()
output = model(net_input)
inference_time = time.time() - start
print('inference time: %.5f' %inference_time)
```

The inference time result is as follow:

inference time: 1.30878 # batch size 2

inference time: 0.02359 # batch size 2

inference time: 0.02301 # batch size 2

inference time: 0.02312 # batch size 2

inference time: 1.30119 # batch size 1

The inference time of the first time and the last time is much slower than the normal ones, this problem can be reproduce anytime when batch size change larger, anyone knows why?