Why is torch.jit.script slower?

111480 · May 3, 2021, 4:32pm

I compared the performance(speed) of Torchvision’s Squeezenet original model with torch.jit.script(model) which I expected to speed up because Torchscript was asynchronous, but the original model was faster. What’s the reason? Which one did I miss?

scripted model : 0.2353s
original model : 0.1644s

googlebot · May 4, 2021, 12:40pm

it is not asynchronous (beyond cuda kernel launches, which is not related to jit), just python-less execution mode with optimizations

one thing I’ve seen, is that some jitted operations incorrectly enable requires_grad

but simpler explanation is that you’re not measuring it right - time the THIRD call of compiled model (actually, from your screenshot it seems you’re compiling twice [i.e. two model objects], which is also incorrect). reason is that profiling mode executor creates optimized bytecode on second call.

111480 · May 4, 2021, 2:42pm

Thanks for replying.
The time was measured separately between torch.jit.script and the original model.
like this,

torch.jit.script(models.squeezenet1_0(pretrained=True).cuda().eval())
torch.jit.script(models.squeezenet1_0(pretrained=True).cuda().eval())
#models.squeezenet1_0(pretrained=True).cuda().eval()
#models.squeezenet1_0(pretrained=True).cuda().eval()

Then result is

#torch.jit.script(models.squeezenet1_0(pretrained=True).cuda().eval())
#torch.jit.script(models.squeezenet1_0(pretrained=True).cuda().eval())
models.squeezenet1_0(pretrained=True).cuda().eval()
models.squeezenet1_0(pretrained=True).cuda().eval()

Why exactly is Torchscript slower than pytorch?

googlebot · May 4, 2021, 2:59pm

as I said, do something like:
net = jit.script(models.squeezenet1_0(pretrained=True).cuda().eval())
net(x); net(x)

and only then measure time (e.g. %timeit net(x))

111480 · May 4, 2021, 3:24pm

As you say, Torchscript is faster.
Is this the result of different ways of optimizing it?
I want a detailed explanation.

googlebot · May 4, 2021, 4:20pm

In a nutshell,

“compilation” analyzes whole functions, with knowledge about variable types - some optimizations are done at this level (e.g. dead code elimination)
python bytecode interpreter is not used to execute generated code - more specialized executor for statically typed code supposedly works faster
fusion optimizations further compile specialized cuda kernels, so e.g. a.mul(b).add(c) is computed in one go
some patterns have specialized optimizations, e.g. conv+batchnorm

vince62s · October 12, 2022, 6:50pm

Hi I am hijacking this thread for a similar question.

I ran the code of this link: pytorch-jit.ipynb · GitHub

exact same time with and without, not seeing the perf difference as per the link.

any clue?
(running on pytorch 1.12.1 cuda 11.3)

ptrblck · October 13, 2022, 8:12am

Your notebook seems to show a ~2x speedup so I’m unsure how to interpret your post. Did you see any other results when properly profiling with syncs?

vince62s · October 13, 2022, 8:43am

this is not “my” notebook. this is someone else’s noebook that I tried on my machine, hence not replicating the speedup.