Are inplace operations faster?

(1) x = a * x + b
(2) x.mul_(a); x.add_(b)

is (2) faster than (1)?

4 Likes

Mostly not and you don’t get JIT optimisations.

Best regards

Thomas

1 Like

So (1) can get JIT optimizations?

1 Like

On the GPU (1) will get fused into a single kernel (instead of two) by the JIT, saving you 1 of 4 reads and 1 of 2 writes by not writing the intermediate a*x to memory. A computation like this on sizeable inputs is memory bound, so your saving significantly (if that is a significant part of your computation).

Best regards

Thomas

2 Likes

Than you very much. Is there any guide on how to write fast PyTorch code like what you said?

But so there are bits of advice in various places, and it might depend on what you want to achieve where the best place to look is.
NVidia’s dev blog (to pick a random article), the PyTorch article on LSTMs, I tried to blog a bit about optimizations and a common reduction code pattern

In the end, you probably want to benchmark optimizations and gather some experience around what works and does not work.

Also, you can always look here at the forums or try to ask with concrete code. We have some cuda questions here every once in a while and I certainly have learnt a lot about GPU programming from people around PyTorch.

Best regards

Thomas

7 Likes

Thank you very much!