What provides faster training times? torch.compile() OR custom C++ and cuda extensions?

I want to decrease my training times for my model. I’ve followed the Performance Tuning Guide but I want to increase my performance a lot more. I am thinking my next options are either to use torch.compile() in 2.0 or develop my custom C++ code and use pybind11 to call my C++ pytorch functions in python. Does anyone have any insight as to what is more performant?

Any insight/advice is appreciated.

torch.compile() is an easier thing to try out and will likely give you some speedups, I personally wouldn’t bother with custom c++ code unless you already have a bunch experience. We don’t explicitly compare torch.compile to custom c++ code but instead compare it to eager pytorch code

Yes, torch.compile() is much easier to try out. I’ll give it a go and see the results. I have a lot of C++ experience so I could write the custom C++ code, but it would take much longer to implement. Since my custom loss functions are a bit complicated with lots of operations, a for loop, and other function calls I suspect that going the custom C++ would be more optimal.