How does one go about overriding the derivative of a primitive matrix operation?

If I wanted to override the implementation of the backward pass of a primitive matrix operation, say, matmul or convolution, how might I go about it?

Some subquestions:
(a) is there any way to do it (efficiently) in Python (just me hoping, haha)?
(b) Assuming the answer to (a) is no, can I implement each operation once in C++ and then it’ll work for all backends, or would I need to implement each operation for each backend e.g. cuda?

Thanks!

You could write a custom C++ extension as seen here. If you are using pure PyTorch operations you won’t need to write backend-specific code. However, you would still have the option to write custom CUDA code if needed.