Are inplace operations faster?

tom · November 16, 2019, 9:58pm

On the GPU (1) will get fused into a single kernel (instead of two) by the JIT, saving you 1 of 4 reads and 1 of 2 writes by not writing the intermediate a*x to memory. A computation like this on sizeable inputs is memory bound, so your saving significantly (if that is a significant part of your computation).

Best regards

Thomas