GRU implementation

Hello, I’m working on custom GRU implementation.

However, when I look into Pytorch GRU implementation (, hy is generated by (1-input_gate) * new_gate + input_gate * hidden, even though according to the original paper, hy is generated by input_gate * new_gate + (1-input_gate) * hidden.

In my opinion, with this implementation, Pytorch GRU seems to work well, because the range of values of input_gate is from 0 to 1, but I’m just curious that there is any specific reason such as for reducing computational cost etc.