SRU and LSTM implement by formula that use cuda is not speed up?

SRU is like LSTM, call the pyotrch LSTM(nn.LSTM) use cuda can speed up obvious compare to the CPU, however use formula to implement LSTM and run it in GPU that the speed is almost same with the CPU, this is my SRU demo https://github.com/bamtercelboo/pytorch_SRU, can help me see my demo whether exists some error during use cuda() ?