Are inplace operations faster?

Beinan_Wang · November 16, 2019, 5:42pm

(1) x = a * x + b
(2) x.mul_(a); x.add_(b)

is (2) faster than (1)?

tom · November 16, 2019, 6:20pm

Mostly not and you don’t get JIT optimisations.

Best regards

Thomas

Beinan_Wang · November 16, 2019, 9:29pm

So (1) can get JIT optimizations?

tom · November 16, 2019, 9:58pm

On the GPU (1) will get fused into a single kernel (instead of two) by the JIT, saving you 1 of 4 reads and 1 of 2 writes by not writing the intermediate a*x to memory. A computation like this on sizeable inputs is memory bound, so your saving significantly (if that is a significant part of your computation).

Best regards

Thomas

Beinan_Wang · November 16, 2019, 11:55pm

Than you very much. Is there any guide on how to write fast PyTorch code like what you said?

tom · November 17, 2019, 9:56pm

But so there are bits of advice in various places, and it might depend on what you want to achieve where the best place to look is.
NVidia’s dev blog (to pick a random article), the PyTorch article on LSTMs, I tried to blog a bit about optimizations and a common reduction code pattern…

In the end, you probably want to benchmark optimizations and gather some experience around what works and does not work.

Also, you can always look here at the forums or try to ask with concrete code. We have some cuda questions here every once in a while and I certainly have learnt a lot about GPU programming from people around PyTorch.

Best regards

Thomas

Beinan_Wang · November 18, 2019, 4:28pm

Thank you very much!