(1) x = a * x + b
(2) x.mul_(a); x.add_(b)
is (2) faster than (1)?
(1) x = a * x + b
(2) x.mul_(a); x.add_(b)
is (2) faster than (1)?
Mostly not and you don’t get JIT optimisations.
Best regards
Thomas
So (1) can get JIT optimizations?
On the GPU (1) will get fused into a single kernel (instead of two) by the JIT, saving you 1 of 4 reads and 1 of 2 writes by not writing the intermediate a*x to memory. A computation like this on sizeable inputs is memory bound, so your saving significantly (if that is a significant part of your computation).
Best regards
Thomas
Than you very much. Is there any guide on how to write fast PyTorch code like what you said?
But so there are bits of advice in various places, and it might depend on what you want to achieve where the best place to look is.
NVidia’s dev blog (to pick a random article), the PyTorch article on LSTMs, I tried to blog a bit about optimizations and a common reduction code pattern…
In the end, you probably want to benchmark optimizations and gather some experience around what works and does not work.
Also, you can always look here at the forums or try to ask with concrete code. We have some cuda questions here every once in a while and I certainly have learnt a lot about GPU programming from people around PyTorch.
Best regards
Thomas
Thank you very much!