Why RNN needs two biases?

In the RNN implementation, there are two biases, b_ih and b_hh.
Why is this? Is it different from just use one bias?
Will it affect performance or efficiency?


You mean in general? For the same reason that it needs two sets of weights, one for the input and one from the previous state.

Taking RNN with tanh activation for example, it follows
h_t = tanh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t-1)} + b_{hh})
b_{ih} and b_{hh} are just biases (trainable constants). It does not matter for input or previous state.
Actually, let b=b_{ih} + b_{hh},
h_t = tanh(w_{ih} * x_t + w_{hh} * h_{(t-1)} + b).
It should be the same.

1 Like

As you pointed out it doesn’t really change the definition of the model, but this is what cuDNN does, so we’ve made our RNNs consistent with this behaviour.

1 Like

OKay. Good to know. Thanks.