In a nutshell,
- “compilation” analyzes whole functions, with knowledge about variable types - some optimizations are done at this level (e.g. dead code elimination)
- python bytecode interpreter is not used to execute generated code - more specialized executor for statically typed code supposedly works faster
- fusion optimizations further compile specialized cuda kernels, so e.g. a.mul(b).add(c) is computed in one go
- some patterns have specialized optimizations, e.g. conv+batchnorm