Hi all,

I’ve recently noticed that pytorch GRUCell is mathematically different from the tensorflow one.

In pytorch, the GRU cell is implemented like this:

```
r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr})
z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz})
n = tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn}))
h' = (1 - z) * n + z * h
```

In tensorflow, the GRU cell is implemented like this:

```
r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr})
z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz})
n = tanh(W_{in} x + b_{in} + W_{hn} (r * h) + b_{hn}))
h' = (1 - z) * n + z * h
```

A subtle difference appears in the computation of `n`

, where pytorch first applies a linear layer to the memory state `h`

and then multiplies by the gate `r`

, whereas TF does this two in the reversed order (and merges biases together, but this seems not important).

The original paper proposing GRU http://arxiv.org/abs/1406.1078 seems to match the tensorflow version. The CUDNN implementation seems do match the pytorch code.

In my case, it seems that the pytorch version converges slower and to worse results. Has anybody measured differences between these two variants?

Does anybody know why this discrepancy appeared in the first place?