TorchScript object detection model is slower

I have compiled an SSD-based object detection model in PyTorch with torch.jit.script(model). I benchmarked the scripted and the original models on Tesla K80 GPU (AWS p2 instance). Looks like the scripted model is slower than the original model.

Averaged over 100 images:
Original model: 0.1787 seconds per image
Scripted model: 0.1928 seconds per image

I also benchmarked a ResNet50 model, got similar slow-down.
Original ResNet50: 0.0281
Scripted ResNet50: 0.0303

I was expecting some speed-up, and disappointed by the slow-down.
Is this normal, or could I have missed something?

Could you share the code you’ve used to profile the code?
Note that CUDA operations are executed asynchronously, so you would have to synchronize the timer via torch.cuda.synchronize() before starting and stopping the timers.

I am aware of the synchronization. I measure by averaging over 100 forward passes (no difference with/without synchronization). If I use with torch.jit.optimized_execution(False): I get similar time as the original model.

model = torchvision.models.resnet50(pretrained=False).cuda()
model.eval()
model = torch.jit.script(model)

# Disable all gradient computations 
torch.set_grad_enabled(False)

# Load & transform the image

# warm-up
embedding = model(image)

t1 = time.time()
N = 100
for i in range(N):
   embedding = model(image)
t2 = time.time()
t = t2 - t1
print('Time: {:.4f}, avg: {:.4f}'.format(t, t / N))

I wonder if TorchScript is sensitive to GPU architecture (faster on the recent GPUs, but slower on old GPUs?). I have not done the same comparison on the new GPUs yet (in any case, I am interested in the old GPUs as they will be used in prod).

I have the same experience now. Did you find a reason?

Not yet. I am guessing it might be related to two things: (1) Model’s network architecture, (2) Type of GPU. I will update here if/when I find a definitive answer.