Is there any benefit to stack multiple GRU one after the other?

laro · November 2, 2022, 9:39pm

I’m working on time series problems.

On the web I saw 2 type of models:

Models which used one GRU with multiple layers.

nn.GRU(num_layers=4,…)

Models which used multiple GRU's (one after the other) with less (or equal) number of layers.

nn.GRU(num_layers=2, ...)
nn.GRU(num_layers=2, ...)

I understand that using one GRU or multiple GRU change the number of wights to calculates and to optimize,

but I didn’t found any explanation for the benefits between the 2 options above:

What is the benefit (or when we want to use) multiple GRU's or one GRU) ?
Is there a benefit for useing one GRU with high number of layers vs multiple GRU's with less layers ?

J_Johnson · November 3, 2022, 7:23am

The two above are practically identical, only difference being semantics. GRU structure for each layer is

https://pytorch.org/docs/stable/generated/torch.nn.GRU.html

You can think of the above as one building block, with additional GRU layers just repeating the same gated algorithm. So whether you want to give each a different label, or put them all under one label, is a matter of preference. The math works out the same.

J_Johnson · November 3, 2022, 7:28am

Additional thoughts:

I could see there being a benefit to splitting it up for very large models, as it’s much easier to assign each “layer” to different GPUs/TPUs during initialization and training. If you put them all in one nn.GRU unit, it’s going to take a lot more code and effort to assign each layer to separate GPUs.

laro · November 3, 2022, 9:53am

Thanks.
If we use 2 GRU the H0 of the second GRU may be initialised with zeros (or different values) (instead of getting the last hidden state of the first GRU),
while using 1 GRU with multiple layers, the hidden state passed to the next layer, across all layers.

Is that something that can change the results dramatically ?

J_Johnson · November 3, 2022, 11:38am

Right, it would change the result. You would need to keep track of 2 hidden states. One for each separate unit. Notice how the hidden state contains the number of layers in the dims:

If you set the hidden state to zeros for the second layer, it would no longer be the equivalent of using multiple layers in one GRU. It would be the same as if you used one GRU and set half of the hidden state to zeros.