GRUcell is different in pytorch and tensorflow

Hi all,
I’ve recently noticed that pytorch GRUCell is mathematically different from the tensorflow one.
In pytorch, the GRU cell is implemented like this:

r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) 
z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz})
n = tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn}))
h' = (1 - z) * n + z * h

In tensorflow, the GRU cell is implemented like this:

r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) 
z = sigmoid(W_{iz} x + b_{iz} + W_{hz} h + b_{hz})
n = tanh(W_{in} x + b_{in} +  W_{hn} (r * h) + b_{hn}))
h' = (1 - z) * n + z * h

A subtle difference appears in the computation of n, where pytorch first applies a linear layer to the memory state h and then multiplies by the gate r, whereas TF does this two in the reversed order (and merges biases together, but this seems not important).

The original paper proposing GRU http://arxiv.org/abs/1406.1078 seems to match the tensorflow version. The CUDNN implementation seems do match the pytorch code.

In my case, it seems that the pytorch version converges slower and to worse results. Has anybody measured differences between these two variants?

Does anybody know why this discrepancy appeared in the first place?

2 Likes

See footnote 1 in this paper https://arxiv.org/pdf/1412.3555.pdf

Thanks for your comment!

For completeness, the difference appeared to come from other differences between TF and pytorch implementations. Difference coming directly from different equations for GRU cells was in fact very small and its direction was not consistent.