Slower Backward performance on CUSTOM C++ EXTENSIONS

I am following this tutorial on CUSTOM C++ AND CUDA EXTENSIONS.

I pretty much took the code available in the page and ran in a Google Colab environment.


From the tutorial page we have the following comparisons


But, when running on the notebook, the backward pass gets worse. Is there any explanation on why this would happen?

====== CPU Default
Forward: 216.003 us | Backward 237.442 us
====== CPU Cpp Custom Extension
Forward: 171.469 us | Backward 437.926 us
====== GPU Default
Forward: 309.739 us | Backward 523.746 us
====== GPU Cpp Custom Extension
Forward: 250.254 us | Backward 879.118 us


This is surprising indeed.
It is possible though that the backward of rnn was heavily optimized since this post was done. And so the manual implementation is not as optimized anymore :confused:

1 Like