Custom RNN/GRU implementation in C++

I am interested in creating my own custom GRU implementation (for example changing the tanh activation to relu), but with the same training efficiency of the torch.nn.GRU class.

I believe I need to implement it as a C++ extension in order to avoid a time-stepping for-loop in Python.

Can anyone point me in the direction of where to start? Ideally, I would base it off the existing torch cpp GRU implementation, but I am struggling to find that in the source code.

Thanks in advance!