Is autograd highly optimized such that there is no need to spend time writing custom cuda code for backward?

Or is cuda code most of the time faster?

I have seen quite a couple of repos writing backward themselves using cuda code and I have not yet have time to test which one is faster.

Hopefully autograd is highly optimized that I do not have to worry about writing cuda