How to best speed up for-loop for Kalman Filter


I am looking to speed up some custom Kalman Filter models. It looks pretty obvious to me that the for-loop is the main performance bottleneck. I have tried to jit-compile the loop-part which gives some improvement but is still far from what I expect to be possible.

Is there any best practice approach that I might still be missing?

My next step would be to write a custom C++ module for the loop logic, but I’m still hopeful that there is a better way to make things faster.

Thanks for any help!