Gated Recurrent Unit Biases General Question

Why does Pytorch’s implementation of Gated Recurrent Units use two sets of biases for the update, reset and new gates? It seems redundant since these biases are summed before activation. Were there any performance gains for this approach, as opposed to just one set of biases?

https://pytorch.org/docs/stable/generated/torch.nn.GRU.html

pytorch just mirrors cudnn I think; what usually happens is Wx+b non-recurrent computation is merged (W = [Wr Wz Wn]) and done first, so theoretical uses of two biases are imaginable, but generally it beats me too why cudnn has this…

I can’t imagine how two separate biases for z and r would increase efficiency or performance in any way. The gradients will be identical. Perhaps the new gate might* have some nominal additional information from two separate biases. as one is added prior to the Hadamard product. But I wonder if it was tested first for any performance gains. Or if, as you mentioned, the underlying architecture in cudnn just made it a better choice to leave in. That’s good to know. But then I wonder how Tensorflow got around that with their v3 implementation.

performance impact here is negligible - same as replacing mm with addmm, because blas gemm routine fuses this bias summation.

1 Like