Redundant biases in LSTM definition

While looking at the definition of the LSTM implemented in PyTorch (, I realized that the cells i, f, g and o have two biases each, instead of a single one. Clearly, a definition with a single bias for each of these cells would be equivalent, in the sense that the two biases could be merged into one by summing them. Is there any implementation advantage in considering two instead of one?

1 Like

Hi Diego,

My understanding is that this is an artefact of CuDNN compatibility.
Note that for GRU there even is a subtle difference between the two.

Best regards



@tom you seem to be right, but do you have any idea why it is done that way in cuDNN?

Maybe they define the recurrent layer as the sum of two linear layers (one for the input and another for the state), where each one has its own bias, so the resulting layer ends up having two biases…